vasantharan
/

Multimodal_Hate_Speech_Detection_in_Dravidian_languages

+# Multimodal Classification Model (Tamil, Malayalam, Telugu)
+This repository contains deep learning models for **text and audio classification** in three languages: **Tamil, Malayalam, and Telugu**.
+---
+## 📌 Overview
+The models accept **text and audio inputs** and classify them into predefined categories. Each language has dedicated trained models and label encoders:
+- **Text Model:** Utilizes `xlm-roberta-large` for feature extraction with a deep learning classifier.
+- **Audio Model:** Uses **MFCC feature extraction** and a CNN-based classifier.
+---
+## 🛠 1. Setup
+### 1.1 Clone the Repository
+```bash
+git clone https://huggingface.co/<your-model-repo>
+cd <your-model-repo>
+```
+### 1.2 Install Dependencies
+Ensure Python is installed, then run:
+```bash
+pip install -r requirements.txt
+```
+---
+## 📂 2. Directory Structure
+```
+├── audio_label_encoders/       # Label encoders for audio models
+├── audio_models/               # Trained audio classification models
+├── text_label_encoders/        # Label encoders for text models
+└── text_models/                # Trained text classification models
+```
+Each folder contains three files, corresponding to **Tamil, Malayalam, and Telugu**.
+---
+## 🚀 3. How to Use
+### 3.1 Load the Models
+```python
+import tensorflow as tf
+import pickle
+import numpy as np
+import torch
+from transformers import AutoTokenizer, AutoModel
+# Load Label Encoders
+with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f:
+    tamil_text_label_encoder = pickle.load(f)
+with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f:
+    tamil_audio_label_encoder = pickle.load(f)
+# Load Models
+text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5")
+audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras")
+```
+---
+## 📝 4. Text Classification
+### 4.1 Preprocess Text
+```python
+from indicnlp.tokenize import indic_tokenize
+from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
+import advertools as adv
+stopwords = list(sorted(adv.stopwords["tamil"]))
+def preprocess_tamil_text(text):
+    tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta"))
+    tokens = [token for token in tokens if token not in stopwords]
+    return " ".join(tokens)
+```
+### 4.2 Extract Features and Predict
+```python
+def extract_embeddings(model_name, texts):
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    model = AutoModel.from_pretrained(model_name)
+    model.eval()
+    embeddings = []
+    batch_size = 16
+    with torch.no_grad():
+        for i in range(0, len(texts), batch_size):
+            batch_texts = texts[i:i + batch_size]
+            encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
+            outputs = model(**encoded_inputs)
+            batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
+            embeddings.extend(batch_embeddings)
+    return np.array(embeddings)
+feature_extractor = "xlm-roberta-large"
+text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது"
+processed_text = preprocess_tamil_text(text)
+text_embeddings = extract_embeddings(feature_extractor, [processed_text])
+text_predictions = text_model.predict(text_embeddings)
+predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
+print("Predicted Label:", predicted_label[0])
+```
+---
+## 🔊 5. Audio Classification
+### 5.1 Preprocess Audio
+```python
+import librosa
+def extract_audio_features(file_path, sr=22050, n_mfcc=40):
+    audio, _ = librosa.load(file_path, sr=sr)
+    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
+    return np.mean(mfccs.T, axis=0)
+```
+### 5.2 Predict Audio Class
+```python
+def predict_audio(file_path):
+    features = extract_audio_features(file_path)
+    reshaped_features = features.reshape((1, 40, 1, 1))
+    predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1)
+    predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class)
+    return predicted_label[0]
+audio_file = "test_audio.wav"
+predicted_audio_label = predict_audio(audio_file)
+print("Predicted Audio Label:", predicted_audio_label)
+```
+---
+## 📊 6. Batch Processing for a Dataset
+### 6.1 Load Dataset
+```python
+import os
+import pandas as pd
+def load_dataset(base_dir='../test', lang='tamil'):
+    dataset = []
+    lang_dir = os.path.join(base_dir, lang)
+    audio_dir = os.path.join(lang_dir, "audio")
+    text_dir = os.path.join(lang_dir, "text")
+    text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0])
+    text_df = pd.read_excel(text_file)
+    for file in text_df["File Name"]:
+        if (file + ".wav") in os.listdir(audio_dir):
+            audio_path = os.path.join(audio_dir, file + ".wav")
+            transcript_row = text_df.loc[text_df["File Name"] == file]
+            transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
+            dataset.append({"File Name": audio_path, "Transcript": transcript})
+        else:
+            transcript_row = text_df.loc[text_df["File Name"] == file]
+            transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
+            dataset.append({"File Name": "Nil", "Transcript": transcript})
+    return pd.DataFrame(dataset)
+dataset_df = load_dataset()
+```
+### 6.2 Predict Text and Audio in Bulk
+```python
+dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text)
+text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist())
+text_predictions = text_model.predict(text_embeddings)
+text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
+dataset_df["Predicted Text Label"] = text_labels
+dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio")
+dataset_df.to_csv("predictions.tsv", sep="\t", index=False)
+```
+---
+## ☁️ 7. Deployment on HF中国镜像站
+```bash
+pip install huggingface_hub
+huggingface-cli login
+```
+```python
+from huggingface_hub import upload_file
+upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="<your-hf-repo>")
+```
+---
+## 📬 Contact
+For issues or improvements, feel free to raise an issue or email [**[email protected]**](mailto\:[email protected]).
+---
+**License:** CC BY-NC 4.0