Stable Diffusion

Stable Diffusion is an open source text to image synthesis AI that can take any natural language English prompt and create compelling photorealistic, artistic, or otherwise imaginative results. It uses a synthesis technique called Latent Diffusion, as do at least a few other image synthesizers. It gets its own first level heading here because it is free and open source.

Here is a demo page, suitable for fun in toying with weird or cool ideas.

Prompt and parameter engineering, and various techniques afforded by Stable Diffusion, can turn out astonishing results, insofar that the quality of images have implications for a sea change in visual arts industries, and raise concerns about authorship, plagiarism, propaganda, crime, (mis)representation of truth or reality itself, censorship, and displacement or obsoletion of working visual artists.

I only mention those concerns to emphasize how impactful AI image synthesis has become, but I make no comments about those concerns. Here I'm after how to make great images using AI. Incidentally, some of my AI art, often including the prompt that made it in displayed metadata, is over here.

AUTOMATIC1111 stable-diffusion-webui

The Stable Diffusion WebUI automates installation and running of Stable Diffusion on desktop on many computer platforms, provided you have a graphics card that can run it. An NVIDIA CUDA card with minimum 4GB VRAM may be required for any or all of it.

Stable Diffusion General Notes and Usage

Fundamental parameters that Stable Diffusion uses for image synthesis, and their attributes:

A text prompt. This is a natural (or unnatural!) language description of an intended image to synthesize. This is explored in the “Prompt Engineering” section, but the other parameters are necessary to explore first for best effect.
Sampling steps. How many times it generates a new image (starting with noise!) and navigates its internal latent space to find things nearer and nearer to what it thinks your prompt is.
Sampling method. The algorithm it uses to generate noise.
Width and height of intended image.
CFG Scale. How much the AI adheres to what it thinks your prompt means. Low values allow it to very liberally interpret things, high values make it conform more strictly.
Seed. This is a number between 0 and about 4 billion which it uses as a basis for making a noise image to start sampling. The same seed will always produce the same start noise with a given diffusion method.

Samplers (Sample Methods)

Re: a breakdown of sampler methods at Reddit (there are other useful Stable Diffusion technique comments as well.)

Note: These notes don't explore sampler methods that were added to the webui after I experimented with and explored. Those additional samplers are: LMS Karras, DPM2 Karras, DPM2 a Karras

All available samplers in the webui are: Euler a, Euler, LMS, Heun, DPM2, DPM2 a, DPM fast, DPM adaptive, DDIM, PLMS

Key to these; also not that where more than one of these are listed with a bullet point, the subsequent ones were designed as improvements:

Bold = good for quick exploration at the start of prompt engineering.
Italic = good for seed travel animation (you can do seed variation from one seed to the next to animate between them)
Bold Italic = both

General note: non-ancestral samplers (ones that don't have a in their name) may generally produce very similar images from the same seed.

Samplers and their history and observed characteristics:

Euler, Huen: simple and related; the latter making “more accurate” noise more slowly.
LMS, PLMS: cousins to those that produce yet “more accurate” noise through averaging. I think PLMS produces more clarity and expressively human results. Untested here: K LMS (LMS Karras), which purportedly tends to correct anatomy at ~64 steps
DDIM, DPM2: adapted for neural networks and may take many iterations to get good results. DPM2 is a fancier version of DDIM designed to get “more accurate” noise in fewer steps.
<method name> a: samplers that end with a in the name (for ancestral) are really more like one another than the previous methods they're named after. They add more noise per step. Results with them can drastically vary from one step to the next and change so drastically seemingly forever. After about 10 sampling steps, Euler a and DPM2 a give you the most variety, even radical variety in as little as 5 sampling steps difference! In images In previews I've seen they produced more heartfelt portraits at 25 or 40 steps. K Euler produces a lot of variety even at 16-20 step values but tends to produce spaggetified/disproportioned humans. DPM2 a produces results very like Euler a but not as much to my liking for some (which?) applications.

“Weirdly, in some comparisons DPM2-A generates very similar images as Euler-A… on the previous seed. Might be due to it being a second-order method vs first-order, might be an experiment muck-up.”

Sampler steps operates with samplers to add noise back into an image on each iteration (I think).

Prompt Engineering

References:

So you want to play GoD? (PART II) prompt engineering methods/recommendations in a Reddit post
How To Geek: How to Write an Awesome Stable Diffusion Prompt

While an effective overall goal may be to be as specific and detailed as possible, depending on visual feedback you get, you may find that it's more effective to be general and allow the AI to interpret and do things you don't write explicitly. To that end:

start simple, short and flexible, to get fundamental aspects of your idea working first. Let this evolve with feedback. Start out with Euler sampling for speed.
maybe look up hypernyms (more generic or broad) versions of the words you come up with, via Wordnik, or alternately find more specific or similar words nearer your intended meaning, via Wordnik's report of words in the “same context.”
find a good starter seed for your core concept, by testing prompts with many renders and then keeping good seeds (for example with webui's batch slider at 9)
To explore a prompt's possible outputs, use the XY Plot script for example like this: X, sampler: Euler,LMS,DDIM and Y, steps: 20,25,30,40
When you have a good seed see if it's also good when sampling with Huen.
click “save” in webui as you go to save images and their info you might want to refer back to
when you have a good base prompt and seed, do XY Plot scripts with steps for X and CFG for Y. (see above Y steps suggestion). For CFG if you go with a (ancestral) samplers, try 20,35,50,100,150 for steps.
from the resulting rendered grid, you can try running more tests with values between your favorite step and CFG columns/rows. If not, save again and:
set the script input to “prompt matrix” and choose 3 concepts for further detail or additions to compare, to toy with making the prompt more elaborate. With this script you append a bar | and ideas separated by bars to the end of the script, and it renders variations of all possible combinations with those additions. (It automatically replaces the bars with commas when it runs).
repeat, refine, and further experiment until results are great or it's clear your idea won't work.
when you're sure you have an overall great prompt, it can be a good time to add things to it like fine details, overall style, or (falsely!) attribute the work to a specific artist, art movement or style, for example like this:
- by Wassily Kandinsky
- by an abstract expressionist
- abstract expressionism
- by a mid-20th century French abstract expressionist
- in the style of (insert art movement)
- a generic descriptor of where it is “from,” for example: from ancient Mesopotamia
- a media description, like “thin watercolor” or “spattered watercolor”
- sigh: trending on arstation
- from a bland stock photo collection
- 1980's polaroid
- 1990's vaporwave
- data bending
- etc. etc. etc. – virtually anything you can imagine – and you will find that describing essentially the same thing a slightly different way will produce a different result! Even including misspelling it in a way that a natural language processor can still understand, or making a tiny syntactical change!
you can also try negative prompts for details or even drastic changes; a negative prompt tells it what not to create.

Prompt Weighting, Modifiers and Experiments

Weighting

Many Stable Diffusion interfaces allow expressing a percent weight for any token (idea or part of a prompt).

(Note: temporary edit, pending experiments to confirm: I may have details wrong here. Here is a reference on attention control in stable-diffusion-webui.)

Here are ways that can be done:

encaustic:0.1

Or if it's a phrase, surround that with quotes:

"colorful encaustic:0.5"

Purportedly, if you use multiple percent weights, they should add up to less than 1. Supposedly any percent you don't use, Stable Diffusion internally distributes among the remainder of unweighted tokens.

You can also emphasize or de-emphasize tokens with syntax:

by exclamation marks: colorful!
emphasize by surrounding a phrase with parenthesis: (varied) – or even nested parenthesis: ((opalescent iridescent))
de-emphasize by surrounding a phrase with angle brackets: [rustic]. You can also nest angle brackets
you can even use negative numbers for weights to incline the AI to do what it thinks is the opposite of the token. Here's a fascinating but (fair warning, disturbing and sometimes very innapropriate) twitter thread about that, in which a user discovered that the AI rendered a specific word in response to a negative prompt, and when they used that word for a prompt, it always produced a particular disturbed image.
shout it in all caps to emphasize: WEATHERED
keep an attribute more connected to a word by hyphenating: yellow-car
move phrases that you want to emphasize more toward the front of the prompt, to make them more essentially the main thing of it. Or towared the end of the prompt to de-emphasize.

Other Experimental Modifiers

stable-diffusion-webui supports a logical style blend operator, "AND". A phrase that includes this will cause Stable Diffusion to try to do a stylistic and/or object blend of the tokens on the left and right of AND. For example:

cat AND dog will cause it to try to create something that is a hybrid of a cat and dog.
Kandinsky AND Hundertwasser will cause it to try to create something the blends the styles of those artists.

Findings from my own experiments:

adding randomization:0.000000013 to the end of a prompt and animating the number in the very small increments of the scope does small detail randomization in the image. With ancestral samplers it may (will?) also radically alter the overall effect of a prompt.
It seems that adding “closeup of..” eliminates shown matte, art frame and/or wall in prompts intended to render art.
“closeup 25%” does not actually do that (it may still show an art frame in a render, for example), but it changes the image or, in the case of a render featuring art, it changes the art itself! Expressing percents in natural language seems to be a weird wild area where it does whatever it interprets as percent in association with any langauge token.
A negative prompt of “very dark shade,” “dark shade,” or “quite dark shade” or similar effectively somewhat solarizes/photoshops out great darkness that can result from prompts declaring a photography style
some stable diffusion distributions can create outright outlines of things which NO, when you are innocently trying to create other things like abstract art, or describing abstract shapes. to avoid that, add NSFW as a negative prompt
watery acrylic or watercolor in a prompt may tend to make things stylized toward a cartoon. to avoid that, add “cartoon” to the negative prompt
a positive prompt directing to crop will tend to stylize things towards photography or photography of art. That can be cool but it tends away from the abstract and (as mentioned) can make things darker. Adding “photograph” to the negative prompt (logically “not photograph” will get more of the abstraction and vividness/brightness back.

Seed

Because the same seed will always produce the same noise, for sampling methods that don't radically alter the noise between steps (like any sampler labeled a (ancestral)), if you use the same seed and tweak other things like iterations and the prompt, you can tweak results. Also, it seems that some seeds tend to just produce better results for some types of prompts.

Variant seed and variant seed strength parameters allow you to blend one seed with another. The variant seed strength is a percent expressed as decimal: 0 means don't use interpolate toward the variant seed at all, .05 means interpolate halfway, 1 means use the variant seed and not the base seed.

Variant seed and var.strength parameters can give subtle changes if you change the value of the latter a little bit.

The Seed Travel script does exactly that: it sets a second seed as variant, and increments the var.strength over time from 0 to 1. “Samplers that work well are: Euler, LMS, Heun, DPM2 & DDIM.” Specifically, if you use ancestral samplers (that start with a), it won't work. You'll get some animation but it will abruptly jump between variations.

Inpainting

Inpainting allows you to change part of an image using a mask and text to image synthesis prompt.

Forgiving the shameless (though ultimately PG-rated) juvenile male gaze in this demo video, it's clear that (at this writing) the Stable Diffusion 1.5 model devoted to inpainting is superior.

Image to Image

The same video demonstrates that model is superior for AI generating an image based on another image.

AI image upscaling

My review of upscalers for purposes of upscaling abstract art that is like marbled watery acrylic + watercolor:

Potentially good enough by themselves:

good: R-ESRGAN General WDN 4xV3
decent overall: SwinIR_4x
fair: R-ESRGAN 4x+

Possibly useful in combination or layering / processing with the above:

great contiguous color, horrible detail: ScuNET
good contiguous color, horrible detail: R-ESRGAN General 4xV3
potentially good if partially used in a process; very highly/rough textured: ESRGAN_4x
poor overall: R-ESRGAN 4x+ Anime6B
very poor overall: ScuNET PSNR
terrible: R-ESRGAN AnimeVideo

Couldn't try LDSR as I get a cert error on trying. Is BSRGAN new since I wrote this? Memory allocation error on trying to use it.

From toying with Stable Diffusion upscaling settings, I get really cool details with Sampling Steps 8, CFG Scale 13, Denoising Strength 0.46. But Denoising Strength isn't accessible in regular Stable Diffusion use I think? Also, this is doing derived (upscaled) images.

References and Resources

A list of "all" artists represented in Stable Diffusion.

Comments: That's an amazing work of reference and generated images. Also, the claim of “all” is probably extremely far-fetched. With a bit of searching and off the top of my head I come up with these prompts, for styles of artists not on that list, which produce images representative of their technique and style:

“A whimsical creature, by Shel Silverstein”
“A fantasy whimsical architecture landscape, by Dr. Seuss”
“A landscape by Hishikawa Moronobu”

If it's even possible to define or know “all” artists represented in Stable Diffusion, my guess is that list doesn't even remotely approach “all.”

earthbound wiki

Table of Contents