About Step1X-Edit

Step1X-Edit is a unified, practical framework for general image editing. Designed to follow real user instructions, it delivers high-quality results that rival advanced closed-source models like GPT-4o and Gemini Flash. By leveraging a multimodal large language model (LLM), Step1X-Edit can understand both your reference image and your editing prompt, then generate the desired output using a diffusion-based image decoder.

How Does Step1X-Edit Work?

The system processes your image and editing instruction together, extracting a latent embedding that guides the image editing process. This embedding is then used by a diffusion decoder to create the edited image. Step1X-Edit is trained on a high-quality, diverse dataset generated through a custom data pipeline, and its performance is evaluated using GEdit-Bench—a benchmark based on real-world user requests.

Key Features

  • Unified model for a wide range of image editing tasks using natural language instructions.
  • Multimodal LLM backbone for understanding both images and text prompts.
  • Diffusion-based image decoder for high-quality, realistic edits.
  • Open-source and accessible, with code and models available for research and development.
  • Evaluated on GEdit-Bench, a benchmark rooted in real user editing needs.

System Requirements

To run Step1X-Edit efficiently, a modern GPU is recommended. For best results, use a GPU with 80GB memory. The model supports different configurations to balance speed and memory usage, including quantized weights and CPU offloading.

ModelPeak GPU Memory (512/786/1024)Speed (28 steps, flash-attn)
Step1X-Edit42.5GB / 46.5GB / 49.8GB5s / 11s / 22s
Step1X-Edit-FP831GB / 31.5GB / 34GB6.8s / 13.5s / 25s
Step1X-Edit + offload25.9GB / 27.3GB / 29.1GB49.6s / 54.1s / 63.2s
Step1X-Edit-FP8 + offload18GB / 18GB / 18GB35s / 40s / 51s

Tested on H800 GPUs. Lower-memory GPUs are supported with quantization or offloading.

Installation & Usage

  1. Requirements: Python 3.10+, PyTorch 2.2+ (tested with torch==2.3.1 or 2.5.1 and CUDA 12.1), and corresponding torchvision.
  2. Install dependencies:
    pip install -r requirements.txt
  3. Install FlashAttention: Use the provided script to find the correct wheel for your system:
    python scripts/get_flash_attn.py
    Download the suggested wheel from the FlashAttention release page and install it.
  4. Download model weights: Get the weights from ModelScope.
  5. Run inference: Use the provided script to edit images:
    bash scripts/run_examples.sh
    For lower memory usage, use FP8 weights with --quantized or enable --offload to move some modules to CPU.

For more details, visit the official GitHub repository.
This page is for informational purposes only. Please refer to the official documentation for the latest updates and instructions.