Where do Large Vision-Language Models Look at when Answering Questions? Paper • 2503.13891 • Published 9 days ago • 7 • 2
TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization Paper • 2204.00097 • Published Mar 31, 2022 • 1
TopNet: Transformer-based Object Placement Network for Image Compositing Paper • 2304.03372 • Published Apr 6, 2023 • 1
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts Paper • 2405.05949 • Published May 9, 2024 • 3
Multi-Reward as Condition for Instruction-based Image Editing Paper • 2411.04713 • Published Nov 6, 2024 • 1
stabilityai/stable-video-diffusion-img2vid-xt-1-1 Image-to-Video • Updated Jul 10, 2024 • 95.4k • 879