ICLR 2026 Workshop

When Test-Time Guidance Is Enough:
Fast Image and Video Editing with Diffusion Guidance

DInG-Editor is a training-free toolkit for posterior-sampling inpainting and editing, unifying image, video, and audio workflows with pluggable denoisers and samplers.

A. Ghorbel, B. Moufad, NB. Shouraki, AO. Durmus, T. Hirtz E. Moulines, J. Olsson, Y. Janati

Accepted as a workshop paper at ReALM-GEN, ICLR 2026

Abstract

Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector-Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

Method Overview

DInG method diagram

DInG separates editing control from denoiser implementation: samplers operate over a shared denoiser interface (pred_velocity, encode, decode) while runners handle Hydra-based orchestration for single samples and dataset pipelines.

  • Unified runners: inpaint_img, inpaint_vid, inpaint_audio
  • Dataset workflows: inpaint and evaluate image/video benchmarks.
  • Supported denoisers: Flux, SD3 variants, LTX, Wan, Stable Audio 1.
  • Supported samplers: Ding, Flair, FlowChef, DiffPIR, DDNM, Blended Diffusion.

Data Flow (Video Editing)

Overview of the editing pipeline for video modalities
Overview of the editing pipeline for video modalities. The input video and mask are lifted to the latent space for inpainting. A pre-trained and frozen diffusion model is used with a posterior sampler to guide the generation toward prompt-aligned reconstructions, which is then decoded back to pixel space.

Resources

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -e .

python -m ding.runner.inpaint_img \
  image_path=/path/to/input_image.png \
  mask_path=/path/to/mask.png

python -m ding.runner.inpaint_vid_dataset
python -m ding.runner.evaluate_vid_dataset

Citation

DInG is is part of a series of publications that explore training-free approach to guide pre-trained diffusion models, If you use it please cite

@article{moufad2026ding,
  title={Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance},
  author={Moufad, Badr and Shouraki, Navid Bagheri and 
          Durmus, Alain Oliviero and Hirtz, Thomas and 
          Moulines, Eric and Olsson, Jimmy and Janati, Yazid},
  journal={ICLR 2026},
  year={2026}
}

@article{ghorbal2026ding-editor,
  title={When Test-Time Guidance Is Enough:
         Fast Image and Video Editing with Diffusion Guidance},
  author={Ghorbel, Ahmed and Moufad, Badr and Shouraki, Navid Bagheri 
          and Durmus, Alain Oliviero and Hirtz, Thomas and 
          Moulines, Eric and Olsson, Jimmy and Janati, Yazid},
  journal={ICLR 2026, ReALM-GEN Workshop},
  year={2026}
}