a tiny vision language model
F5-TTS & E2-TTS: Zero-Shot Voice Cloning (Unofficial Demo)
Generate depth maps from images
In-browser speech recognition w/ word-level timestamps