You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

English-Swahili Subtitle Translator (v1)

A specialized machine translation model optimized for subtitle content conversion between English and Swahili, preserving timing formats and handling common subtitle annotations.

Model Details

  • Architecture: MarianMT (Transformer-based)
  • Training Data: 500k subtitle pairs + general domain text
  • Max Sequence Length: 512 tokens
  • Special Features:
    • Preserves timestamps (00:01:23,456 --> 00:01:25,678)
    • Handles subtitle annotations ([MUSIC], (OFF-SCREEN))
    • Context-aware translation for dialogue continuity

First install requirements:

pip install transformers sentencepiece pysrt

## Basic Translation
from transformers import MarianMTModel, MarianTokenizer

## Load model and tokenizer
model = MarianMTModel.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")
tokenizer = MarianTokenizer.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")

def translate_subtitle(text):
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
  outputs = model.generate(**inputs)
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

Example usage

print(translate_subtitle("We need to hurry, the show starts in 5 minutes!"))
# Output: "Tunahitaji kufanya haraka, kipindi kinaanza baada ya dakika 5!"

Full Subtitle File Processing

import pysrt

def translate_srt_file(input_path, output_path):
    subs = pysrt.open(input_path)
    
    for sub in subs:
        # Preserve original timestamps
        original_time = f"{sub.start} --> {sub.end}"
        
        # Clean and translate text
        clean_text = ' '.join([line for line in sub.text.split('\n') if '-->' not in line])
        translated = translate_subtitle(clean_text)
        
        # Rebuild subtitle with timing
        sub.text = f"{original_time}\n{translated}"
    
    subs.save(output_path)

# Usage
translate_srt_file("episode.srt", "translated_episode.srt")

Best Practices

1. Preserve Formatting:

Keep annotations like [MUSIC] unchanged

  text = "[INTENSE MUSIC] The final battle begins"
  translation = "[INTENSE MUSIC] Vita vya mwisho vyaanza"

2. Handle Line Breaks:

Split long lines into subtitle-friendly chunks

def split_subtitle(text, max_length=42):
    return '\n'.join([text[i:i+max_length] for i in range(0, len(text), max_length)])

3. Context Window:

Use previous 2 lines as context

context = []
def contextual_translate(text):
    context.append(text)
    if len(context) > 2:
        context.pop(0)
    return translate_subtitle(' '.join(context))

Limitations

  • Max 3 lines per subtitle segment
  • Best performance on conversational text
  • May require post-editing for:
    • Cultural references
    • Idiomatic expressions
    • Proper noun pronunciations

Ethical Considerations

  • May reflect biases in training data
  • Cultural nuances in Swahili dialects:
    • Prefer Tanzanian variants
    • Avoid regional slang
    • Always human-validate translations for sensitive content

Finetuned by Emmanuel Minga (0755652681)

Downloads last month
24
Safetensors
Model size
74.4M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for ngosha/English-Swahili-Subs-Translator-v1

Finetuned
(1)
this model