From Selfie to AI: How to Teach Stable Diffusion Your Face and More

Demo time!

Training

You will need:

- some Python + good NVIDIA card (at least 16 GB VRAM)

- kohya-ss/sd-scripts with or without some extensions (e.g. cocktailpeanut/fluxgym)

- 1h - 2h - it depends on the number of images

- demo should take around 10-15 minutes

source env/bin/activate
./outputs/train.sh 

----
nvidia-smi

Before we start

Fake but good enough

Contract

Complex math will be presented
LaTeX will be presented
Inappropriate content will be presented
Adult content will be presented
Stupid "selfies" will be presented

Agenda

Complex math will be presented
LaTeX will be presented
Inappropriate content will be presented
Adult content will be presented
Stupid "selfies" will be presented

Why?

Curiosity!

Welcome

Name and Surname:

Piotr Stapp

Experience in IT:

18 years

Position:
System Principal Architect

Specialization:

Cleaner, Cloud, Code, Infra

Distinguishing marks:

Don't Stapp me now!

Theory

Core Concepts

Diffusion models learn to reverse a gradual noising process applied to training images.

The 2 key processes are:

Forward Process (Diffusion):
Noise is incrementally added to an image over several steps until it's indistinguishable from pure noise.
Reverse Process (Denoising):
The model learns how to undo this noising process by predicting and removing noise at each step to recover the original image.

Core Concepts - math

Formally, they model:

A Markov chain that adds Gaussian noise:
$q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \mathbf{I})

A neural network learns the reverse denoising:

$q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

Source:

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

Core Concepts - image

Text to image

Text Encoding (e.g., with CLIP, T5 or BERT)
Image Denoising with guidance

Image Denoising

During image generation:

Start with pure Gaussian noise.
The model (typically a U-Net) uses the text embedding to conditionally predict what the less-noisy version of the image should look like.
This process repeats for 25–1000 timesteps.

There are different types of conditioning:

Classifier-free guidance (used in Stable Diffusion) allows the model to balance between purely random outputs and tightly matching the text prompt.

U-Net

A U-Net is a type of convolutional neural network (CNN) originally developed for biomedical image segmentation, but it’s now widely used in many image-to-image tasks, including diffusion models like Stable Diffusion.

U-Net

Dae-Young Kang, Hieu Pham Duong, and Jung-Chul Park.
Application of Deep Learning in Dentistry and Implantology. Implantology, vol. 24, no. 3, 2020, pp. 148–181. arXiv

U-Net

M. J. Cardoso, A. Li, A. D. McClure, et al. Deep learning tools for the cancer clinic: an open-source framework with head and neck contour validation. Radiation Oncology, 2022. Available at: https://www.researchgate.net/publication/358442721

LoRA

Avi Chawla. Full Model Fine-Tuning vs LoRA vs QLoRA. Daily Dose of Data Science, 2025. Available at: https://blog.dailydoseofds.com/p/full-model-fine-tuning-vs-lora-vs

Practice

Theory vs practice

Theory is when we know everything, but nothing works!
Practice is when everything works, and no one knows why.
In this room, we combine theory with practice.
Nothing works, and no one knows why.

- prof. Jan Miodek

Demo time!
Let's see how it is going!

Practice is hard

Requirements

Model - The best (according to Internet) model generate images is FLUX.1-dev model created by Black Forest Labs:
- This model CANNOT be used in commercial projects, but there are versions which allows that
Trening - Internet says that LoRA (whatever 😉) is good enough
NVIDIA card with 16-24 GB VRAM
Disk:
- 24 GB for FLUX.1-dev
- 10-30 GB for "python & stuff"

FLUX.1 familly

ELO score from 01-10-2024

FLUX.1-dev

Current:

Text To Image Leaderboard

Requirements

Requirements

PyTorch-crucial framework

How to start? (1/3)

kohya-ss/sd-scripts:

frequent updates (last from 2025-03-01)
Supports:
- DreamBooth training
- Fine-tuning (native training)
- LoRA training
- Textual Inversion training
- Image generation
- Model conversion (supports 1.x and 2.x, Stable Diffision ckpt/safetensors and Diffusers)
Works with low VRAM

How to start? (2/3)

ostris/ai-toolkit:

similar to previous
TBH: I failed with this one, but I didn't spend a lot of time on it - VRAM problems - 16 GB vs 24 GB
only FLUX models

How to start? (3/3)

cocktailpeanut/fluxgym:

Simple UI over Kohya Scripts
Generates shell script (see demo)
Supports only Flux models

My machine

Results on my machine

After 14 hours – 80 iterations out of 1600
First simple images (each 20 iterations) – rather poor than good
I stopped and decided to use cloud instead

Cloud FTW

A lot of options including GPU per minute
In my case - Standard NC24ads A100 v4
- Azure - Sweden Cental (Zone 3)
- 24 vcpus, 220 GiB memory
- Nvidia PCIe A100 GPU (80GB)
- ~ 5$ per hour - on demand
- ~ 1$ per hour - spot
- VS Code remote 😎

NVIDIA <3 Linux?

Demo

fluxgym

Safetensors files

A .safetensors file is a secure and efficient format for storing tensors - multi-dimensional arrays used in machine learning models.

It was developed as a safer alternative to the traditional PyTorch .pt or .pth files.

Safetensors files

Key Features of .safetensors:
Safe: It avoids arbitrary code execution during loading, which can be a risk with .pt files. This makes it more secure, especially when downloading models from the internet.
Fast: It supports memory-mapped loading, which allows models to be loaded faster and with lower memory usage.
Portable: It’s a binary format that can be used across different platforms and frameworks.

Data is the key!

First traning

I prepared 10 images of myself
Different poses
Different lights
[...]
Everything as YouTube, websites, etc. suggested

Testing prompts

stapp holding a sign that says 'I LOVE PROMPTS!'
photo of a stapp, white background, profile photo, sun in the background
photo of a stapp in superman costume, flying in the sky

Some results

Only one question
???WTF???

My first dataset

Generating

Code FTW

from diffusers import FluxPipeline
import torch, os, datetime
from dotenv import load_dotenv

def load_model(model_name, safetensor_path=None):
    load_dotenv()
    pipeline = FluxPipeline.from_pretrained(
        model_name,token=os.getenv("HF_TOKEN"),torch_dtype=torch.float16)
    pipeline.load_lora_weights(".", weight_name=safetensor_path)
    pipeline = pipeline.to("cuda" if torch.cuda.is_available() else "cpu")
    return pipeline

def main():
    input_text = "A stapp as superman."
    pipeline = load_model("black-forest-labs/FLUX.1-dev", "./lora/stapp-v1.safetensors")
    for i in range(3):
        time=datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
        output_image_path = f"output/generated_image_{time}_{i}.png"
        pipeline(input_text).images[0].save(output_image_path)
        print(f"Image generated and saved to {output_image_path}")

if __name__ == "__main__":
    main()

Examples

Prompting is still the key!

Stapp as Captain America

Prompting is still the key!

Stapp as Captain America

Prompting is still the key!

Stapp wearing glasses as Captain America, in a dynamic full-body pose, holding a large shield. Dark-blue, high-tech costume with star-patterned chest, hints of bronze. Helmeted, serious expression. Dramatic sunset sky with dark clouds. Shield is wooden-toned with cream bands and central star. Battlefield setting. Realistic lighting, photorealistic style. Heroic pose, strong posture, confident demeanor. Highly detailed, dramatic lighting.

Prompting is still the key!

How to create a prompt?

Using LLMs :)
Or online: https://imageprompt.org/describe-image

Demo time

input_text_short = "Stapp with Barack Obama"

input_text_long = "Stapp with Barack Obama in the White House, both smiling and shaking hands. The entire scene should be visible, capturing the moment of camaraderie and friendship."

----
python src/main-simple.py

External stuff

Different options

LoRA can contain style instead of face :)
There are a lot of possibilities included and available as LoRA

Built-in Styles

Source: https://enragedantelope.github.io/Styles-FluxDev/

IKEA style

Demo mixing
stapp + ikea

# stapp lora
pipeline.load_lora_weights(
  "./lora/stapp-v1.safetensors", weight_name="stapp.safetensors", adapter_name="stapp")

# IKEA lora
pipeline.load_lora_weights(
  "./lora/ikea.safetensors", weight_name="ikea.safetensors", adapter_name="ikea")

pipeline.set_adapters(["stapp", "ikea"], adapter_weights=[0.9, 0.7])

Ghibli studio style

Demo
Ghibli studio

# LoRA Ghibli + stapp lora
python src/mmain-ghibli-mix.py

# plain LoRA Ghibli
python src/main-ghibli-plain.py

What about XXX?

A Critical Warning:

The Ease of Misuse and Societal Harm

Uncensored - FLUX + LoRA

Uncensored - plain FLUX

Demo? NO!

You can try it, but it won't be presented

# lora
python src/main-uncensored-lora.py

# plain flux
python src/main-uncensored-plain.py

Not enough time for this 😉

Fill or inpainting

A way to replace or filling areas in the existing images
Different models (e.g. FLUX.1-Fill-[dev] instead of FLUX.1-[dev])

DEMO - Fill or inpainting

python main_fill.py

import torch
from diffusers import FluxFillPipeline
from diffusers.utils import load_image
from dotenv import load_dotenv

load_dotenv()

image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png")
mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")

pipe = FluxFillPipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", torch_dtype=torch.bfloat16).to("cuda")
image = pipe(
    prompt="a cup with a handle and NIKE logo",
    image=image,
    mask_image=mask,
    height=1632,
    width=1232,
    guidance_scale=30,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu")#.manual_seed(0)
).images[0]
image.save(f"flux-fill-dev.png")
print("Successfully inpaint image")

Image variations

Flux models: FLUX.1-Redux - dev, schnell or pro
But there are different too :)

Structural conditioning

Uses canny edge or depth detection to maintain precise control during image transformations.
Users can make text-guided edits while keeping the core composition intact

Multiple faces

I'm still working on it, but results are just poor
Combining LoRA is a bad idea
Fill uses a different model
It is easier to generate an alternative version of my children than me with my wife :)

Alternatives

GPT-image-1 supports multiple modalities and features:

Text-to-image: Generate images from text prompts, similar to text2im in ChatGPT DALL-E.
Image-to-image: Create new images from user-uploaded images and text prompts, a feature not available in ChatGPT DALL-E.
Text transformation: Edit images using text prompts, akin to the transform feature in ChatGPT DALL-E.
Inpainting: Edit images with text prompts and user-drawn bounding boxes, similar to inpainting with DALL-E.

GPT-image-1

Google Nano Banana

Video swap :)

Source: https://huggingface.co/spaces/KwaiVGI/LivePortrait

The end?

References

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020.
https://huggingface.co/blog/annotated-diffusion
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021.
Dae-Young Kang, Hieu Pham Duong, and Jung-Chul Park. Application of Deep Learning in Dentistry and Implantology. Implantology, vol. 24, no. 3, 2020, pp. 148–181. arXiv
M. J. Cardoso, A. Li, A. D. McClure, et al. Deep learning tools for the cancer clinic: an open-source framework with head and neck contour validation. Radiation Oncology, 2022. Available at: https://www.researchgate.net/publication/358442721
https://pytorch.org/
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup
Flux Style Test Gallery - https://enragedantelope.github.io/Styles-FluxDev/
FLUX NSFW / Nude Prompts and Learnings - https://betterwaifu.com/blog/flux-nsfw
Nextusos/Flux-Uncensored-V2 - https://github.com/Nextusos/Flux-Uncensored-V2
Flux docs - https://github.com/black-forest-labs/flux
photo AI (my repo) - https://github.com/ptrstpp950/photoAi
Live Portrait - https://huggingface.co/spaces/KwaiVGI/LivePortrait
Hugging Face - https://huggingface.co/

The cost of a lie:

$5 and a few photos

With just $5–$10 and a few poor-quality images, anyone - even teens - can make fake content.

That’s why it’s so important to talk to your kids and other young people in your life, like cousins, nieces, nephews, or your friends’ kids.

A quick conversation can make a big difference.

👇Slides, links & feedback 👇

https://www.linkedin.com/feed/update/urn:li:activity:7322896120173543424?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAFoowoBfnkasuJGcMK36PzbSnYIZLkUE4A

https://www.linkedin.com/posts/genai-works_artificialintelligence-promptengineering-activity-7324342617985413120-Ve-8/?utm_source=share&utm_medium=member_ios&rcm=ACoAAAFoowoBfnkasuJGcMK36PzbSnYIZLkUE4A

https://x.com/godofprompt/status/1911839239823581467

https://www.brainyquote.com/quotes/brian_odriscoll_680037

I Tested Tons of AI Image Generators — These 10 Are the Best by Far | by Nitin Sharma | The Startup | May, 2025 | Medium

From Selfie to AI

By Piotr Stapp

From Selfie to AI: How to Teach Stable Diffusion Your Face and More

Demo time!

Training

Before we start

Fake but good enough

Contract

Agenda

Why?

Curiosity!

Welcome

Theory

Core Concepts

Core Concepts - math

Core Concepts - image

Text to image

Image Denoising

U-Net

U-Net

U-Net

LoRA

Practice

Theory vs practice

Demo time! Let's see how it is going!

Practice is hard

Requirements

FLUX.1 familly

FLUX.1-dev

Requirements

Requirements

PyTorch-crucial framework

How to start? (1/3)

How to start? (2/3)

How to start? (3/3)

My machine

Results on my machine

Cloud FTW

NVIDIA <3 Linux?

NVIDIA <3 Linux?

Demo

fluxgym

Safetensors files

Safetensors files

Data is the key!

First traning

Testing prompts

Some results

Only one question ???WTF???

My first dataset

Generating

Code FTW

Examples

Prompting is still the key!

Prompting is still the key!

Prompting is still the key!

Prompting is still the key!

How to create a prompt?

Demo time

External stuff

Different options

Built-in Styles

IKEA style

Demo mixing stapp + ikea

Ghibli studio style

Demo Ghibli studio

What about XXX?

A Critical Warning:

The Ease of Misuse and Societal Harm

Uncensored - FLUX + LoRA

Uncensored - plain FLUX

Demo? NO!

You can try it, but it won't be presented

Not enough time for this 😉

Fill or inpainting

DEMO - Fill or inpainting

Image variations

Structural conditioning

Multiple faces

Alternatives

GPT-image-1

Google Nano Banana

Demo time!
Let's see how it is going!

Only one question
???WTF???

Demo mixing
stapp + ikea

Demo
Ghibli studio