# Fine-Tuning DistilBERT on Drugs.com Reviews for Depression

**Author**: Zakia Salod

**Affiliation**: University of KwaZulu-Natal (UKZN), Durban, South Africa

**Contact**: zakia.salod@gmail.com

**Machine Used**: Google Colab T4 GPU

**Last Updated**: 9 December 2023

**Description**:
This notebook details the process of fine-tuning the DistilBERT model on a subset of the Drugs.com reviews dataset, specifically focusing on reviews related to depression. It encompasses steps including data loading, preprocessing, and filtering for relevant content, along with cleaning and balancing the data. The notebook then proceeds to initialize and fine-tune the DistilBERT model for sequence classification, aiming to distinguish between high and low-quality reviews. The fine-tuned model is evaluated and saved, with the option to upload it to the HF中国镜像站 Hub. This exercise aims to leverage the capabilities of DistilBERT for nuanced text classification tasks in the pharmaceutical domain.


**License**:
This work is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). Free for educational and research use.



## STEP 1: SETTING UP THE ENVIRONMENT

### Load Necessary Libraries

In [None]:
# Enable automatic module reloading to reflect changes in external .py files
%load_ext autoreload
# Reload all modules before executing code, keeping modules up-to-date
%autoreload 2

### Install Required Packages

In [None]:
!pip install transformers



In [None]:
!pip install datasets torch wandb

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading Gi

In [None]:
!pip install torch



In [None]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


In [None]:
!pip install transformers[torch]



In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


### Import Necessary Libraries

In [None]:
import numpy as np
from datasets import load_dataset, load_metric, concatenate_datasets
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, Trainer, TrainingArguments
import evaluate
import html
import re
import csv
from collections import Counter
import pandas as pd

### Initialize Weights & Biases for Experiment Tracking

In [None]:
# Import Weights & Biases for experiment tracking
import wandb

In [None]:
# Initialize wandb for tracking and visualizing the training
wandb.init(project="distilbert-drugscom_depression_reviews", name="run_1")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## STEP 2: LOAD AND PRE-PROCESS DRUG REVIEWS DATASET

In [None]:
dataset_name = "Zakia/drugscom_reviews"

In [None]:
train_ds = load_dataset(dataset_name, split="train")

Downloading readme:   0%|          | 0.00/6.59k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/84.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
test_ds = load_dataset(dataset_name, split="test")

In [None]:
# Filter out rows with missing drugName, condition, or review for the train dataset
train_ds = train_ds.filter(lambda x: all([x.get("drugName"), x.get("condition"), x.get("review")]))

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

In [None]:
# Filter out rows with missing drugName, condition, or review for the test dataset
test_ds = test_ds.filter(lambda x: all([x.get("drugName"), x.get("condition"), x.get("review")]))

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

In [None]:
# Filter the dataset for the condition 'Depression' for the train dataset
train_ds = train_ds.filter(lambda example: example["condition"] == "Depression")

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

In [None]:
# Filter the dataset for the condition 'Depression' for the test dataset
test_ds = test_ds.filter(lambda example: example["condition"] == "Depression")

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

In [None]:
# Function to clean review text
def clean_review(text):
    # Check if the text is a string
    if not isinstance(text, str):
      return ""  # Return an empty string if the input is not a string
    text = html.unescape(text)  # Decode HTML entities
    text = re.sub(r'"', '', text)  # Remove quotes
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    return text

In [None]:
# Clean the reviews
# Apply the clean_review function in a batched manner
def clean_reviews(batch):
    # Apply clean_review to each review in the batch and return the modified batch
    return {"review": [clean_review(review) for review in batch["review"]]}

In [None]:
# Clean the reviews for the train dataset
train_ds = train_ds.map(clean_reviews, batched=True)

Map:   0%|          | 0/9069 [00:00<?, ? examples/s]

In [None]:
# Clean the reviews for the test dataset
test_ds = test_ds.map(clean_reviews, batched=True)

Map:   0%|          | 0/3095 [00:00<?, ? examples/s]

In [None]:
RATING_THRESHOLD = 5  # Assuming a rating above 5 is considered positive

In [None]:
useful_counts = train_ds["usefulCount"]  # This extracts the list of useful counts of the train dataset

In [None]:
USEFUL_COUNT_THRESHOLD = np.percentile(useful_counts, 75)  # Calculate the 75th percentile

In [None]:
print(USEFUL_COUNT_THRESHOLD)

65.0


In [None]:
# Define the function that computes the high_quality_review
def high_quality_review(text):
    is_high_rating = text["rating"] > RATING_THRESHOLD  # Assuming a rating above 5 is positive
    is_high_usefulCount = text["usefulCount"] > USEFUL_COUNT_THRESHOLD
    return {"high_quality_review": int(is_high_rating and is_high_usefulCount)}

In [None]:
# Apply the high_quality_review function to the train dataset
train_ds = train_ds.map(high_quality_review)

Map:   0%|          | 0/9069 [00:00<?, ? examples/s]

In [None]:
# Apply the high_quality_review function to the test dataset
test_ds = test_ds.map(high_quality_review)

Map:   0%|          | 0/3095 [00:00<?, ? examples/s]

In [None]:
# Balance the train dataset
# Filter the high_quality_review == 1 and == 0 separately for the train dataset
high_quality_reviews = train_ds.filter(lambda example: example["high_quality_review"] == 1)
low_quality_reviews = train_ds.filter(lambda example: example["high_quality_review"] == 0)

Filter:   0%|          | 0/9069 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9069 [00:00<?, ? examples/s]

In [None]:
# Downsample the low_quality_reviews to match the number of high_quality_reviews in the train dataset
low_quality_reviews_downsampled = low_quality_reviews.shuffle(seed=42).select(range(len(high_quality_reviews)))

In [None]:
# Combine the two datasets back together and shuffle
balanced_train_ds = concatenate_datasets([high_quality_reviews, low_quality_reviews_downsampled]).shuffle(seed=42)

In [None]:
print("Number of records in the balanced training dataset:", len(balanced_train_ds))

Number of records in the balanced training dataset: 4240


In [None]:
# Count for training dataset
train_high_quality_counts = Counter([example["high_quality_review"] for example in balanced_train_ds])
print("Training dataset high quality review counts:", train_high_quality_counts)

Training dataset high quality review counts: Counter({0: 2120, 1: 2120})


In [None]:
print("Number of records in the testing dataset:", len(test_ds))

Number of records in the testing dataset: 3095


In [None]:
# Count for testing dataset
test_high_quality_counts = Counter([example["high_quality_review"] for example in test_ds])
print("Testing dataset high quality review counts:", test_high_quality_counts)

Testing dataset high quality review counts: Counter({0: 2328, 1: 767})


## STEP 3: PREPARE THE DISTILBERT MODEL FOR FINE-TUNING

In [None]:
# Initialize the DistilBERT model
model_name = "distilbert-base-uncased"

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Initialize the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Function to encode the texts
def encode(examples):
    tokenized_output = tokenizer(examples["review"], truncation=True, padding="max_length", max_length=512)
    # Add labels to the tokenized output dictionary
    tokenized_output['labels'] = examples["high_quality_review"]
    return tokenized_output

In [None]:
# Tokenize all reviews and calculate their lengths
lengths = [len(tokenizer.encode(review)) for review in balanced_train_ds['review']]

# Get the maximum length
max_length = max(lengths)

# This is the longest review in the dataset
print(f"The longest review is {max_length} tokens long.")

Token indices sequence length is longer than the specified maximum sequence length for this model (567 > 512). Running this sequence through the model will result in indexing errors


The longest review is 672 tokens long.


In [None]:
# Apply the encode function to the balanced training dataset
train_encoded = balanced_train_ds.map(encode, batched=True)

Map:   0%|          | 0/4240 [00:00<?, ? examples/s]

In [None]:
# Apply the encode function to the test dataset
test_encoded = test_ds.map(encode, batched=True)

Map:   0%|          | 0/3095 [00:00<?, ? examples/s]

In [None]:
# Save the train dataset to a file
# Define the file path in the current directory
file_path = "./train_dataset.tsv"

# Write the dataset to a TSV file
with open(file_path, "w", newline='', encoding="utf-8") as file:
    writer = csv.writer(file, delimiter="\t")
    # Write the header
    writer.writerow(train_encoded.column_names)
    # Write the data rows
    for example in train_encoded:
        writer.writerow([example[col] for col in train_encoded.column_names])

In [None]:
# Save the test dataset to a file
# Define the file path in the current directory
file_path = "./test_dataset.tsv"

# Write the dataset to a TSV file
with open(file_path, "w", newline='', encoding="utf-8") as file:
    writer = csv.writer(file, delimiter="\t")
    # Write the header
    writer.writerow(test_encoded.column_names)
    # Write the data rows
    for example in test_encoded:
        writer.writerow([example[col] for col in test_encoded.column_names])

## STEP 4: FINE-TUNING THE MODEL

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./distilbert-drugscom_depression_reviews",
    learning_rate=3e-5,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",  # Align with evaluation_strategy
    push_to_hub=False,
    report_to="wandb",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    seed=42
)

In [None]:
# Define metrics for evaluation
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return evaluate.load("accuracy").compute(predictions=predictions, references=labels)

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=test_encoded,
    compute_metrics=compute_metrics
)

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3799,0.797809,0.765751


TrainOutput(global_step=265, training_loss=0.18166468638294148, metrics={'train_runtime': 256.159, 'train_samples_per_second': 16.552, 'train_steps_per_second': 1.035, 'total_flos': 561661770301440.0, 'train_loss': 0.18166468638294148, 'epoch': 1.0})

## STEP 5: SAVE THE FINE-TUNED DISTILBERT MODEL: distilbert-drugscom_depression_reviews

In [None]:
from huggingface_hub import notebook_login # To log to our HF中国镜像站 account to be able to upload models to the Hub.

In [None]:
# Login to HF中国镜像站 within the notebook
notebook_login()
!git config --global credential.helper store

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Save and push the model to the hub
model.push_to_hub("distilbert-drugscom_depression_reviews")
tokenizer.push_to_hub("distilbert-drugscom_depression_reviews")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Zakia/distilbert-drugscom_depression_reviews/commit/01897dd83328aa1b4e138231b6afa228d12a4efb', commit_message='Upload tokenizer', commit_description='', oid='01897dd83328aa1b4e138231b6afa228d12a4efb', pr_url=None, pr_revision=None, pr_num=None)