---
license: openrail++
tags:
- stable-diffusion
- text-to-image
pinned: true
---

# Model Card for flex-diffusion-2-1

<!-- Provide a quick summary of what the model is/does. [Optional] -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.

## TLDR:

### There are 2 models in this repo:
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.

For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)

### It aims to solve the following issues:
1. Generated images looks like they are cropped from a larger image.

2. Generating non-square images creates weird results, due to the model being trained on square images.
Examples:

| resolution      | model   |   stable diffusion           |   flex diffusion              |
|:---------------:|:-------:|:----------------------------:|:-----------------------------:|
| 576x1024 (9:16) | v2-1    | ![img](imgs/21-576-1024.png) | ![img](imgs/21f-576-1024.png) |
| 576x1024 (9:16) | v2-base | ![img](imgs/2b-576-1024.png) | ![img](imgs/2bf-576-1024.png) |
| 1024x576 (16:9) | v2-1    | ![img](imgs/21-1024-576.png) | ![img](imgs/21f-1024-576.png) |
| 1024x576 (16:9) | v2-base | ![img](imgs/2b-1024-576.png) | ![img](imgs/2bf-1024-576.png) |

### Limitations:
1. It's trained on a small dataset, so it's improvements may be limited.
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.

### Potential improvements:
1. Train on a larger dataset.
2. Train on different resolutions even for the same aspect ratio.
3. Train on specific aspect ratios, instead of a range of aspect ratios.


#  Table of Contents

- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Downstream Use [Optional]](#downstream-use-optional)
  - [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
  - [Recommendations](#recommendations)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
    - [Preprocessing](#preprocessing)
    - [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)
  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
    - [Testing Data](#testing-data)
    - [Factors](#factors)
    - [Metrics](#metrics)
  - [Results](#results)
- [Model Examination](#model-examination)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
  - [Model Architecture and Objective](#model-architecture-and-objective)
  - [Compute Infrastructure](#compute-infrastructure)
    - [Hardware](#hardware)
    - [Software](#software)
- [Citation](#citation)
- [Glossary [optional]](#glossary-optional)
- [More Information [optional]](#more-information-optional)
- [Model Card Authors [optional]](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.

finetuned resolutions:
|    |   width |   height | aspect ratio   |
|---:|--------:|---------:|:---------------|
|  0 |     512 |     1024 | 1:2            |
|  1 |     576 |     1024 | 9:16           |
|  2 |     576 |      960 | 3:5            |
|  3 |     640 |     1024 | 5:8            |
|  4 |     512 |      768 | 2:3            |
|  5 |     640 |      896 | 5:7            |
|  6 |     576 |      768 | 3:4            |
|  7 |     512 |      640 | 4:5            |
|  8 |     640 |      768 | 5:6            |
|  9 |     640 |      704 | 10:11          |
| 10 |     512 |      512 | 1:1            |
| 11 |     704 |      640 | 11:10          |
| 12 |     768 |      640 | 6:5            |
| 13 |     640 |      512 | 5:4            |
| 14 |     768 |      576 | 4:3            |
| 15 |     896 |      640 | 7:5            |
| 16 |     768 |      512 | 3:2            |
| 17 |    1024 |      640 | 8:5            |
| 18 |     960 |      576 | 5:3            |
| 19 |    1024 |      576 | 16:9           |
| 20 |    1024 |      512 | 2:1            |

- **Developed by:** Jonathan Chang
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s)**: English
- **License:** creativeml-openrail-m
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
- **Resources for more information:** More information needed

# Uses

- see https://huggingface.co/stabilityai/stable-diffusion-2-1


# Training Details

## Training Data

- LAION aesthetic dataset, subset of it with 6+ rating
  - https://laion.ai/blog/laion-aesthetics/
  - https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used a small portion of that, see [Preprocessing](#preprocessing)


- most common aspect ratios in the dataset (before preprocessing)

|    | aspect_ratio   |   counts |
|---:|:---------------|---------:|
|  0 | 1:1            |   154727 |
|  1 | 3:2            |   119615 |
|  2 | 2:3            |    61197 |
|  3 | 4:3            |    52276 |
|  4 | 16:9           |    38862 |
|  5 | 400:267        |    21893 |
|  6 | 3:4            |    16893 |
|  7 | 8:5            |    16258 |
|  8 | 4:5            |    15684 |
|  9 | 6:5            |    12228 |
| 10 | 1000:667       |    12097 |
| 11 | 2:1            |    11006 |
| 12 | 800:533        |    10259 |
| 13 | 5:4            |     9753 |
| 14 | 500:333        |     9700 |
| 15 | 250:167        |     9114 |
| 16 | 5:3            |     8460 |
| 17 | 200:133        |     7832 |
| 18 | 1024:683       |     7176 |
| 19 | 11:10          |     6470 |

- predefined aspect ratios

|    |   width |   height | aspect ratio   |
|---:|--------:|---------:|:---------------|
|  0 |     512 |     1024 | 1:2            |
|  1 |     576 |     1024 | 9:16           |
|  2 |     576 |      960 | 3:5            |
|  3 |     640 |     1024 | 5:8            |
|  4 |     512 |      768 | 2:3            |
|  5 |     640 |      896 | 5:7            |
|  6 |     576 |      768 | 3:4            |
|  7 |     512 |      640 | 4:5            |
|  8 |     640 |      768 | 5:6            |
|  9 |     640 |      704 | 10:11          |
| 10 |     512 |      512 | 1:1            |
| 11 |     704 |      640 | 11:10          |
| 12 |     768 |      640 | 6:5            |
| 13 |     640 |      512 | 5:4            |
| 14 |     768 |      576 | 4:3            |
| 15 |     896 |      640 | 7:5            |
| 16 |     768 |      512 | 3:2            |
| 17 |    1024 |      640 | 8:5            |
| 18 |     960 |      576 | 5:3            |
| 19 |    1024 |      576 | 16:9           |
| 20 |    1024 |      512 | 2:1            |


## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

### Preprocessing


1. download files with url &amp; caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
2. use img2dataset to convert to webdataset
    - https://github.com/rom1504/img2dataset
    - I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
    - the output folder is `/mnt/aesthetics6plus`, change this to your own folder

```bash
echo INPUT_FOLDER=first-file
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
        --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
        --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
```

3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
- use webdataset to load the data
- calculate the aspect ratio of each image
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
- random crop the image to the associated resolution. E.g. crop to 512x1024
- if more than 10% of the image is lost in the cropping, discard this example.
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio


### Speeds, Sizes, Times

- Dataset size: 100k image-caption pairs, before filtering.
  - I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.

- Hardware: 1 RTX3090 GPUs

- Optimizer: 8bit Adam

- Batch size: 32
  - actual batch size: 2
  - gradient_accumulation_steps: 16
  - effective batch size: 32

- Learning rate: warmup to 2e-6 for 500 steps and then kept constant

- Learning rate: 2e-6
- Training steps: 6k
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
  - Each example is seen 1.92 times on average.

- Training time: approximately 1 day

## Results

More information needed

# Model Card Authors

Jonathan Chang


# How to Get Started with the Model

Use the code below to get started with the model.


```python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel

def use_DPM_solver(pipe):
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    return pipe

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1",
    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
    torch_dtype=torch.float16,
    )
# for v2-base, use the following line instead
#pipe = StableDiffusionPipeline.from_pretrained(
#  "stabilityai/stable-diffusion-2-base",
#    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
#    torch_dtype=torch.float16)
pipe = use_DPM_solver(pipe).to("cuda")
pipe = pipe.to("cuda")

prompt = "a professional photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")
```