About Step1X-Edit
Step1X-Edit is a unified, practical framework for general image editing. Designed to follow real user instructions, it delivers high-quality results that rival advanced closed-source models like GPT-4o and Gemini Flash. By leveraging a multimodal large language model (LLM), Step1X-Edit can understand both your reference image and your editing prompt, then generate the desired output using a diffusion-based image decoder.
How Does Step1X-Edit Work?
The system processes your image and editing instruction together, extracting a latent embedding that guides the image editing process. This embedding is then used by a diffusion decoder to create the edited image. Step1X-Edit is trained on a high-quality, diverse dataset generated through a custom data pipeline, and its performance is evaluated using GEdit-Bench—a benchmark based on real-world user requests.
Key Features
- Unified model for a wide range of image editing tasks using natural language instructions.
- Multimodal LLM backbone for understanding both images and text prompts.
- Diffusion-based image decoder for high-quality, realistic edits.
- Open-source and accessible, with code and models available for research and development.
- Evaluated on GEdit-Bench, a benchmark rooted in real user editing needs.
System Requirements
To run Step1X-Edit efficiently, a modern GPU is recommended. For best results, use a GPU with 80GB memory. The model supports different configurations to balance speed and memory usage, including quantized weights and CPU offloading.
Model | Peak GPU Memory (512/786/1024) | Speed (28 steps, flash-attn) |
---|---|---|
Step1X-Edit | 42.5GB / 46.5GB / 49.8GB | 5s / 11s / 22s |
Step1X-Edit-FP8 | 31GB / 31.5GB / 34GB | 6.8s / 13.5s / 25s |
Step1X-Edit + offload | 25.9GB / 27.3GB / 29.1GB | 49.6s / 54.1s / 63.2s |
Step1X-Edit-FP8 + offload | 18GB / 18GB / 18GB | 35s / 40s / 51s |
Tested on H800 GPUs. Lower-memory GPUs are supported with quantization or offloading.
Installation & Usage
- Requirements: Python 3.10+, PyTorch 2.2+ (tested with torch==2.3.1 or 2.5.1 and CUDA 12.1), and corresponding torchvision.
- Install dependencies:
pip install -r requirements.txt
- Install FlashAttention: Use the provided script to find the correct wheel for your system:
Download the suggested wheel from the FlashAttention release page and install it.python scripts/get_flash_attn.py
- Download model weights: Get the weights from ModelScope.
- Run inference: Use the provided script to edit images:
For lower memory usage, use FP8 weights withbash scripts/run_examples.sh
--quantized
or enable--offload
to move some modules to CPU.
For more details, visit the official GitHub repository.
This page is for informational purposes only. Please refer to the official documentation for the latest updates and instructions.