Edit Transfer: Learning Image Editing via Vision In-Context Relations
Abstract
We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data (2025)
- InteractEdit: Zero-Shot Editing of Human-Object Interactions in Images (2025)
- FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models (2025)
- MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing (2025)
- GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing (2025)
- Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on HF中国镜像站 checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper