Submitted by zhangysk 116 OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models · 19 authors 6
Submitted by VictoriaLinML 51 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models · 11 authors 2
Submitted by wenqsun 51 DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion · 7 authors 4
Submitted by j-min 29 M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding · 5 authors 4
Submitted by passing2961 23 Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model · 5 authors 3
Submitted by jonathan-roberts1 22 Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? · 3 authors 3
Submitted by shehan97 21 VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos · 7 authors 3
Submitted by Lmxyy 18 SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models · 10 authors 3
Submitted by notmahi 18 DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation · 7 authors 2
Submitted by scofield7419 17 RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval · 2 authors 3
Submitted by ChuhanLi 17 M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models · 6 authors 2
Submitted by kmcode 15 SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation · 6 authors 4
Submitted by He-Yen 15 GazeGen: Gaze-Driven User Interaction for Visual Content Generation · 8 authors 2
Submitted by ShuhongZheng 13 Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models · 5 authors 2