README.md · stepfun-ai/stepvideo-ti2v at 575cb78b76ff856f91d63eb9b82c28da9078bc0d

metadata

license: mit
library_name: diffusers
pipeline_tag: image-to-video

🔥🔥🔥 News!!

Mar 17, 2025: 👋 We release the inference code and model weights of Step-Video-Ti2V. Download
Mar 17, 2025: 🎉 We have made our technical report available as open source. Read

🚀 Inference Scripts

We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.

python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.

parallel=4  # or parallel=8
url='127.0.0.1'
model_dir=where_you_download_dir

torchrun --nproc_per_node $parallel run_parallel.py \
    --model_dir $model_dir \
    --vae_url $url \
    --caption_url $url  \
    --ulysses_degree  $parallel \
    --prompt "男孩笑起来" \
    --first_image_path ./assets/demo.png \
    --infer_steps 50 \
    --save_path ./results \
    --cfg_scale 9.0 \
    --motion_score 5 \
    --time_shift 12.573

Motion Control

Motion Amplitude Control

Motion = 2	Motion = 5	Motion = 10

Motion = 2	Motion = 5	Motion = 20

🎯 Tips The default motion_score = 5 is suitable for general use. If you need more stability, set motion_score = 2, though it may be less responsive to certain movements. For greater movement flexibility, you can use motion_score = 10 or motion_score = 20 to enable more intense actions. Feel free to customize the motion_score based on your creative needs to fit different use cases.

Camera Control

镜头环绕	镜头推进	镜头拉远

镜头固定	镜头左移	镜头右摇

Introduction
Model Summary
Model Download
Model Usage
Benchmark
Online Engine
Citation
Acknowledgement

1. Introduction

We present Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.

2. Model Summary

In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.

2.1. Video-VAE

A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations.

2.2. DiT w/ 3D Full Attention

Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions.

2.3. Video-DPO

In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process.

3. Model Download

Models	🤗Huggingface	🤖Modelscope
Step-Video-T2V	download	download
Step-Video-T2V-Turbo (Inference Step Distillation)	download	download

4. Model Usage

📜 4.1 Requirements

The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:

Model	height/width/frame	Peak GPU Memory	50 steps w flash-attn	50 steps w/o flash-attn
Step-Video-T2V	544px992px204f	77.64 GB	743 s	1232 s
Step-Video-T2V	544px992px136f	72.48 GB	408 s	605 s

An NVIDIA GPU with CUDA support is required.
- The model is tested on four GPUs.
- Recommended: We recommend to use GPUs with 80GB of memory for better generation quality.
Tested operating system: Linux
The self-attention in text-encoder (step_llm) only supports CUDA capabilities sm_80 sm_86 and sm_90

🔧 4.2 Dependencies and Installation

Python >= 3.10.0 (Recommend to use Anaconda or Miniconda)
PyTorch >= 2.3-cu121
CUDA Toolkit
FFmpeg

git clone https://github.com/stepfun-ai/Step-Video-TI2V.git
conda create -n stepvideo python=3.10
conda activate stepvideo

cd StepFun-StepVideo
pip install -e .

🚀 4.3 Inference Scripts

We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.

python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.

parallel=4  # or parallel=8
url='127.0.0.1'
model_dir=where_you_download_dir

torchrun --nproc_per_node $parallel run_parallel.py \
    --model_dir $model_dir \
    --vae_url $url \
    --caption_url $url  \
    --ulysses_degree  $parallel \
    --prompt "男孩笑起来" \
    --first_image_path ./assets/demo.png \
    --infer_steps 50 \
    --save_path ./results \
    --cfg_scale 9.0 \
    --motion_score 5 \
    --time_shift 12.573

🚀 4.4 Best-of-Practice Inference settings

Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters:

Models	infer_steps	cfg_scale	time_shift	num_frames
Step-Video-T2V	30-50	9.0	13.0	204
Step-Video-T2V-Turbo (Inference Step Distillation)	10-15	5.0	17.0	204

5. Benchmark

We are releasing Step-Video-T2V Eval as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style.

6. Online Engine

The online version of Step-Video-T2V is available on 跃问视频, where you can also explore some impressive examples.

7. Citation

@misc{
}

8. Acknowledgement

We would like to express our sincere thanks to the xDiT team for their invaluable support and parallelization strategy.
Our code will be integrated into the official repository of Huggingface/Diffusers.
We thank the FastVideo team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.

stepfun-ai
/

stepvideo-ti2v