Reangle-A-Video: 4D Video Generation as Video-to-Video Translation Paper • 2503.09151 • Published 2 days ago • 26
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published 3 days ago • 54
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning Paper • 2503.04812 • Published 10 days ago • 12
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published 4 days ago • 23
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 8 days ago • 79
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published 15 days ago • 29
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks Paper • 2502.17157 • Published 18 days ago • 51
MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published 22 days ago • 179
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 22 days ago • 129
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling Paper • 2501.16975 • Published Jan 28 • 26
Diffusion Adversarial Post-Training for One-Step Video Generation Paper • 2501.08316 • Published Jan 14 • 33
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Paper • 2412.17153 • Published Dec 22, 2024 • 34
Large Motion Video Autoencoding with Cross-modal Video VAE Paper • 2412.17805 • Published Dec 23, 2024 • 24
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion Paper • 2412.09626 • Published Dec 12, 2024 • 20
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Paper • 2412.09619 • Published Dec 12, 2024 • 26