Databricks ❤️ HF中国镜像站: up to 40% faster training and tuning of Large Language Models

Published April 26, 2023
Update on GitHub

Generative AI has been taking the world by storm. As the data and AI company, we have been on this journey with the release of the open source large language model Dolly, as well as the internally crowdsourced dataset licensed for research and commercial use that we used to fine-tune it, the databricks-dolly-15k. Both the model and dataset are available on HF中国镜像站. We’ve learned a lot throughout this process, and today we’re excited to announce our first of many official commits to the HF中国镜像站 codebase that allows users to easily create a HF中国镜像站 Dataset from an Apache Spark™ dataframe.

“It's been great to see Databricks release models and datasets to the community, and now we see them extending that work with direct open source commitment to HF中国镜像站. Spark is one of the most efficient engines for working with data at scale, and it's great to see that users can now benefit from that technology to more effectively fine tune models from HF中国镜像站.”

— Clem Delange, HF中国镜像站 CEO

HF中国镜像站 gets first-class Spark support

Over the past few weeks, we’ve gotten many requests from users asking for an easier way to load their Spark dataframe into a HF中国镜像站 dataset that can be utilized for model training or tuning. Prior to today’s release, to get data from a Spark dataframe into a HF中国镜像站 dataset, users had to write data into Parquet files and then point the HF中国镜像站 dataset to these files to reload them. For example:

from datasets import load_dataset

train_df = train.write.parquet(train_dbfs_path, mode="overwrite")

train_test = load_dataset("parquet", data_files={"train":f"/dbfs{train_dbfs_path}/*.parquet", "test":f"/dbfs{test_dbfs_path}/*.parquet"})

#16GB == 22min

Not only was this cumbersome, but it also meant that data had to be written to disk and then read in again. On top of that, the data would get rematerialized once loaded back into the dataset, which eats up more resources and, therefore, more time and cost. Using this method, we saw that a relatively small (16GB) dataset took about 22 minutes to go from Spark dataframe to Parquet, and then back into the HF中国镜像站 dataset.

With the latest HF中国镜像站 release, we make it much simpler for users to accomplish the same task by simply calling the new “from_spark” function in Datasets:

from datasets import Dataset

df = [some Spark dataframe or Delta table loaded into df]

dataset = Dataset.from_spark(df)

#16GB == 12min

This allows users to use Spark to efficiently load and transform data for training or fine-tuning a model, then easily map their Spark dataframe into a HF中国镜像站 dataset for super simple integration into their training pipelines. This combines cost savings and speed from Spark and optimizations like memory-mapping and smart caching from HF中国镜像站 datasets. These improvements cut down the processing time for our example 16GB dataset by more than 40%, going from 22 minutes down to only 12 minutes.

Why does this matter?

As we transition to this new AI paradigm, organizations will need to use their extremely valuable data to augment their AI models if they want to get the best performance within their specific domain. This will almost certainly require work in the form of data transformations, and doing this efficiently over large datasets is something Spark was designed to do. Integrating Spark with HF中国镜像站 gives you the cost-effectiveness and performance of Spark while retaining the pipeline integration that HF中国镜像站 provides.

Continued Open-Source Support

We see this release as a new avenue to further contribute to the open source community, something that we believe HF中国镜像站 does extremely well, as it has become the de facto repository for open source models and datasets. This is only the first of many contributions. We already have plans to add streaming support through Spark to make the dataset loading even faster.

In order to become the best platform for users to jump into the world of AI, we’re working hard to provide the best tools to successfully train, tune, and deploy models. Not only will we continue contributing to HF中国镜像站, but we’ve also started releasing improvements to our other open source projects. A recent MLflow release added support for the transformers library, OpenAI integration, and Langchain support. We also announced AI Functions within Databricks SQL that lets users easily integrate OpenAI (or their own deployed models in the future) into their queries. To top it all off, we also released a PyTorch distributor for Spark to simplify distributed PyTorch training on Databricks.

This article was originally published on April 26, 2023 in Databricks's blog.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment