# Automatic curation of the HF中国镜像站 Hub using Collections and the `huggingface_hub` library

In this short tutorial, we will see how to create a HF中国镜像站 Collection automatically using the `huggingface_hub` library. We'll focus on creating a collection that will curate the top 10% most used instruction tuning datasets on the Hub. 

If you are already familiar with Collections and the `huggingface_hub` library, you can skip to the next section.

## What is a HF中国镜像站 Collection?

Collections are a recently added feature on the HF中国镜像站 Hub which unlock some really powerful new ways of curating what is on the Hub. With the Hub becoming the defacto platform for open-source machine learning models, it is important to be able to curate the content on the Hub. Collections allow you to do just that.

Collections can be used to organize models, datasets, Spaces, and papers on the Hub in various different ways. You could for example create collections around a particular use case, or a particular topic, or a particular model architecture. You could also create collections that are a combination of these things. In this tutorial, we will create a collection that curates the top 10% most used instruction tuning datasets on the Hub. We will do this using the `huggingface_hub` library.

## So what is the `huggingface_hub` library?

The `hub` library is a Python library that allows you to interact with the HF中国镜像站 Hub. It allows you to do things like upload and download models, datasets, and Spaces. Recently the library added support for creating and managing collection. This ability to programmatically create and manage collections unlocks a bunch of exciting new use cases. In this tutorial we'll show a few possibilities of what you can do with the `huggingface_hub` library and Collections but we're excited to see what you will do with it! 

## Install packages

For this tutorial, the only package we'll need outside of the Python standard library is the `huggingface_hub` library.

In [1]:
%pip install git+https://github.com/huggingface/huggingface_hub --upgrade

Collecting git+https://github.com/huggingface/huggingface_hub
  Cloning https://github.com/huggingface/huggingface_hub to /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-hs4ssvjo
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/huggingface_hub /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-hs4ssvjo
  Resolved https://github.com/huggingface/huggingface_hub to commit c32d4b31b679c9e91b906709631901f6aa85324d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


## Authenticate

In order to create and manage collections, you need to be authenticated. You can do this via the `huggingface_hub` library using the `login` function. This function will detect where you are running your code and suggest the best way to authenticate.

In [2]:
from huggingface_hub import login

In [3]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Finding the right datasets using the `huggingface_hub` library

We can use the `huggingface_hub` library to list datasets on the Hub using the `list_datasets` function. We can optionally pass in a query to search for datasets that match a particular query. We can also optionally pass in other filters that allow us to further refine the datasets returned by the library. For example, we could filter to only include datasets for a particular task, or datasets that have a particular type of license.

In [4]:
from huggingface_hub import list_datasets

For this tutorial we'll keep our approach fairly simple and just look for datasets that have the word `instruction` in the name.

In [5]:
datasets = list_datasets(search="instruction", full=True)

List datasets returns a generator. This means that we can process a large number of datasets, models or Spaces without running out of memory. 

In [6]:
type(datasets)

generator

We can start filtering our results by removing any datasets that don't have at least a single download. Since we're doing this in a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), this will 'consume' the generator. This means that this step will take a little bit of time since it's actually starting to call the HF中国镜像站 Hub API to get the datasets.

In [7]:
datasets = [dataset for dataset in datasets if dataset.downloads > 1]

### Getting the top 10% instruction tuned datasets

What do we mean by the top 10% instruction tuned datasets? There are various different metrics we could use to define what we mean by top 10%. For this tutorial we'll focus on the number of likes for a dataset. We could also look at other metrics, like the number of downloads, the level of user discussion on the dataset, the quality of documentation, the number of examples...Whilst likes isn't a perfect metric it does give us a good starting point.

## Getting the number of likes for all of our datasets

To know what our cuttoff point is for the top 10% of datasets, we need to know how many likes the most liked dataset has. We can do this by going through all of our datasets and getting the number of likes for each one. Let's take a peek at a single dataset from our list of datasets.

In [8]:
datasets[0]

DatasetInfo: { 
  {'_id': '621ffdd236468d709f183185',
   'author': 'darkraipro',
   'cardData': None,
   'citation': None,
   'description': None,
   'disabled': False,
   'downloads': 340,
   'gated': False,
   'gitalyUid': 'f21ae5b0c1d4859c8ba1412e2aa6682e34e9596cd4cc027758b8557475f0ae11',
   'id': 'darkraipro/recipe-instructions',
   'lastModified': '2022-01-18T16:22:01.000Z',
   'likes': 0,
   'private': False,
   'sha': 'e7feba49dd438849ec3d309ec4ab52a0fd39fc39',
   'siblings': [],
   'tags': ['region:us']}
}

We can see that each dataset is a `DatasetInfo` object. This object contains a bunch of information about the dataset. We can see that the number of likes is stored in the `likes` attribute. We can use this to get the number of likes for each dataset.

In [9]:
likes = [dataset.likes for dataset in datasets]

## Calculate the threshold for the top 10% of datasets

To calculate the threshold for the top 10% of datasets, we'll create a function that takes our list of like numbers and return the threshold that separates the top 10% of likes from the rest. We can then use this function to get the threshold for our list of datasets.

In [30]:
import math
from typing import List

In [11]:
def get_threshold(numbers: List[int], threshold: float = 0.90) -> int:
    sorted_numbers = sorted(numbers)
    index = math.ceil(len(sorted_numbers) * threshold) - 1
    return sorted_numbers[index]

In [12]:
threshold = get_threshold(likes)
threshold

10

### Filter our datasets to only include those with a number of likes above the threshold

In [13]:
datasets = [dataset for dataset in datasets if dataset.likes > threshold]

In [14]:
len(datasets)

13

## Creating our collection 

Now that we've got a subset of datasets which match our curation criteria we can move to the next step of creating a Collection to which we can add these datasets.

We can to this using the `create_collection` function. This function allows us to create a Collection programmatically. We must pass in a `title` and we can also specify a `description` and a `namespace`. If you don't specify a namespace, the collection will be created in your personal namespace but since I want to add this collection to the `librarian-bots` organization I'll specify it explicitly here. 

The `exists_ok` parameter allows us to specify what to do if a collection with the same title already exists. If we set this to `True` then the function will return the existing collection. If we set this to `False` then the function will raise an error if a collection with the same title already exists.

In [15]:
from huggingface_hub import create_collection

collection = create_collection(
    title="Top 10% instruction tuning datasets",
    description="Collects datasets with 'instruction' in the name and more than 1 download and in the top 10% for the number of likes",
    namespace="librarian-bots",
    exists_ok=True,
)

Lets take a quick look at the collection we've created.

In [16]:
collection

Collection: { 
  {'description': "Collects datasets with 'instruction' in the name and more than 1 download and in the top 10% for the "
                  'number of likes',
   'items': [],
   'last_updated': datetime.datetime(2023, 9, 25, 12, 36, 58, 301000, tzinfo=datetime.timezone.utc),
   'owner': 'librarian-bots',
   'position': 0,
   'private': False,
   'slug': 'librarian-bots/top-10-instruction-tuning-datasets-65117eeaca29f41ae7ae39fe',
   'theme': 'indigo',
   'title': 'Top 10% instruction tuning datasets',
   'url': 'https://huggingface.co/collections/librarian-bots/top-10-instruction-tuning-datasets-65117eeaca29f41ae7ae39fe'}
}

When we call the `create_collection` function we get back a `Collection` object. This object contains a bunch of information about the collection. We can see for example the title, description, and namespace of the collection.

We can also see that at the moment the attribute `items` is an empty list. The `items` attribute stores the datasets, models, Spaces, and papers that are in the collection. We can add items to the collection using the `add_collection_item` function.

Before we add our items to the collection we can do one more bit of additional curation: sorting by downloads. For this collection we don't have a huge number of items so sorting isn't as important but the order of a collection can be used to express so additional information about the collection. For example, you could sort a collection by the date a item as last updated, or by the number of downloads, or by the number of likes. For our example we'll sort by the number of downloads.

In [17]:
sorted_datasets = sorted(datasets, key=lambda dataset: dataset.downloads, reverse=True)

Let's take a quick peek at the first two examples to see if this looks okay!

In [18]:
sorted_datasets[:2]

[DatasetInfo: { 
   {'_id': '64773a98906bb0203e52faad',
    'author': 'LinkSoul',
    'cardData': {'dataset_info': {'dataset_size': 13444870155,
                                  'download_size': 3542585235,
                                  'features': [{'dtype': 'string', 'name': 'id'},
                                               {'list': [{'dtype': 'string', 'name': 'from'},
                                                         {'dtype': 'string', 'name': 'value'}],
                                                'name': 'conversations'},
                                               {'dtype': 'string', 'name': 'instruction'}],
                                  'splits': [{'name': 'train', 'num_bytes': 13444870155, 'num_examples': 10077297}]}},
    'citation': None,
    'description': None,
    'disabled': False,
    'downloads': 2968,
    'gated': False,
    'gitalyUid': '27a10b39e75118535e9d37a774ac8e8f0af89e44385f38bb930c6d5474270e1d',
    'id': 'LinkSoul/instruction_merge

## Adding items to our collection 

Now we're ready to populate our collection. We'll use the `add_collection_item` function to add each dataset to our collection. We can use the `?` operator to get more information about this function.

In [19]:
from huggingface_hub import add_collection_item

In [20]:
?add_collection_item

[0;31mSignature:[0m
[0madd_collection_item[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mcollection_slug[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mitem_id[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mitem_type[0m[0;34m:[0m [0;34m'CollectionItemType_T'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnote[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexists_ok[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'Collection'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Add an item to a collection on the Hub.

Args:
    collection_slug (`str`):
        Slug of the collection to update. Example: `"TheBlo

As you can see the `add_collection_item` function requires a `collection_slug` argument. This is to let `add_collection_item` know which collection to add the item to. We can get the `collection_slug` from the `Collection` object we created earlier. 

We also need to specify the `item_id` of the item we want to add. For datasets we can access the `id` from the `DatasetInfo` object to get this value. Additionally we need to specify the type of the item we want to add. This should be one of `dataset`, `model`, `space`, or `paper`. 

We can optionally add a note which we could use to store some additional information about the item. For example, we could use this to store the reason why we added this item to the collection. In this case we'll store any tags that the dataset has.

In [21]:
for dataset in datasets:
    if dataset.tags is not None:
        note = f"Dataset has the following tags: {dataset.tags}"
    else:
        note = "Dataset does not have any tags"
    add_collection_item(
        collection.slug,
        item_id=dataset.id,
        item_type="dataset",
        note=note,
    )

## Taking a look at our collection

The `huggingface_hub` library has a `get_collection` function which allows us to get a `Collection` object from the Hub. We can use this to take a look at our collection.

In [22]:
from huggingface_hub import get_collection

We'll pass in the `collection_slug` to the `get_collection` function to get our collection. We can then take a look at the `items` attribute to see the items in our collection.

In [23]:
updated_collection = get_collection(collection.slug)
updated_collection.items[:2]

[CollectionItem: { 
   {'author': 'Muennighoff',
    'downloads': 313,
    'gated': False,
    'isLikedByUser': False,
    'item_id': 'Muennighoff/natural-instructions',
    'item_object_id': '65117eeaec7fac9ec2fcaec1',
    'item_type': 'dataset',
    'lastModified': '2022-12-23T20:08:44.000Z',
    'likes': 18,
    'note': "Dataset has the following tags: ['task_categories:other', 'annotations_creators:crowdsourced', "
            "'annotations_creators:expert-generated', 'multilinguality:monolingual', 'size_categories:100M<n<1B', "
            "'language:en', 'region:us']",
    'position': 0,
    'private': False,
    'repoType': 'dataset',
    'viewer': 'viewer'}
 },
 CollectionItem: { 
   {'author': 'qwedsacf',
    'downloads': 225,
    'gated': False,
    'isLikedByUser': False,
    'item_id': 'qwedsacf/grade-school-math-instructions',
    'item_object_id': '65117eeb3368c9f41c835e6a',
    'item_type': 'dataset',
    'lastModified': '2023-02-11T01:59:26.000Z',
    'likes': 21,
    '

We can see that our collection now contains the datasets we added to it. We can now also begin to think of some possible ways we could programmatically explore our collections. For example we could quickly look at the mean number of downloads for the datasets in our collection.

In [24]:
from statistics import mean

mean(item.downloads for item in updated_collection.items)

502.6923076923077

We could also use other functionality from the `huggingface_hub` library to explore our collection. For example, we could use the `dataset_info` function to try and grab the language of each dataset in our collection.

In [25]:
from huggingface_hub import dataset_info

In [26]:
def try_get_languages(dataset):
    try:
        return dataset_info(dataset.id).cardData["language"]
    except KeyError:
        return None

In [27]:
[try_get_languages(dataset) for dataset in datasets]

[['en'],
 None,
 None,
 ['en'],
 None,
 ['en'],
 ['en'],
 None,
 None,
 None,
 ['en'],
 None,
 None]

We can see here that of the datasets in our collection which have language information, the most common language is English. Quite a few of the datasets in our collection don't have language information. This might be a good opportunity to contribute to the datasets by adding language information to them! 

In [28]:
for dataset in datasets:
    language = try_get_languages(dataset)
    if language is None:
        print(
            f"{dataset.id} has no language and could benefit from a PR to add it! Here is the url to fix it: https://huggingface.co/datasets/{dataset.id}/edit/main/README.md "
        )

qwedsacf/grade-school-math-instructions has no language and could benefit from a PR to add it! Here is the url to fix it: https://huggingface.co/datasets/qwedsacf/grade-school-math-instructions/edit/main/README.md 
HuggingFaceH4/instruction-dataset has no language and could benefit from a PR to add it! Here is the url to fix it: https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset/edit/main/README.md 
ArmelR/stack-exchange-instruction has no language and could benefit from a PR to add it! Here is the url to fix it: https://huggingface.co/datasets/ArmelR/stack-exchange-instruction/edit/main/README.md 
openllmplayground/pandagpt_visual_instruction_dataset has no language and could benefit from a PR to add it! Here is the url to fix it: https://huggingface.co/datasets/openllmplayground/pandagpt_visual_instruction_dataset/edit/main/README.md 
rewoo/planner_instruction_tuning_2k has no language and could benefit from a PR to add it! Here is the url to fix it: https://hu

## Looking at the collection on the Hub

We can also take a look at our collection on the Hub. We can quickly get to the URL for our collection on the Hub using the `url` attribute of our `Collection` object.

In [29]:
updated_collection.url

'https://huggingface.co/collections/librarian-bots/top-10-instruction-tuning-datasets-65117eeaca29f41ae7ae39fe'

# Conclusion and other things to try

In this tutorial we've seen how to use the `huggingface_hub` library to create a collection that curates the top 10% most used instruction tuning datasets on the Hub. We've also seen how we can use the `huggingface_hub` library to explore our collection and the datasets in it.

There are many potential opportunities to build on this approach to automatically/semi-automatically curate useful collections. If you come up with a cool use case for this approach, we'd love to hear about it! You can ping me on Twitter ([@vanstriendaniel](https://twitter.com/vanstriendaniel)) or you can add to this [Discussion](https://huggingface.co/spaces/librarian-bots/tutorials/discussions/1).