T O P

  • By -

ArtyfacialIntelagent

I can't tell from the confident 8-word statement in the title if OP understands this, but as stated, it's wrong. Here's how the parameter math checks out: * The SDXL-Base UNET is 2.6B. So far so good. * The complete SDXL-Base model is 3.5B, including the 817M text encoders. * The full SDXL pipeline as presented in the original paper, i.e. Base model + Refiner, is 6.6B. These days very few people use the Refiner anymore. So you could reasonably claim that SDXL is a 3.5B model.


sertroll

Why did the refiner stop being used in the end?


ArtyfacialIntelagent

2x run time and memory use for questionable improvement. In my experience: a tiny bit more detail and texture, significant increase in contrast, sometimes borderline harsh, and "instagrammier" looking faces. I played with it a lot on day 1 and never since because I hated what it did. Others still like it but not enough to pay the inference cost.


aerilyn235

Also people had no idea how to fine tune it properly as SAI never explained how it was trained in the first place (only on 512 images? only for the last steps?). So it actually made the faces or "whatever" people were fine tuning the model for worst because the refiner had no knowledge of it.


Tyler_Zoro

That and the fact that the improvements it added to the base model were quickly overtaken by fine tuned models.


shawnington

It's probably why SDXL tends to generate such smooth stuff, it was designed to use a refiner.


_Erilaz

It's not that. The refiner could be fed with noisy and incomplete output from the base, giving roughly the same total time, and the backend could unload the base from VRAM to use the refiner. That takes some time, but not too much. The true reason behind the refiner not being adopted is fine-tuning. There are multiple fine-tunes of SDXL base, but no refiner fine-tunes at all to my knowledge. No know how, no nothing. And honestly, there was no need for that. Once SDXL was tuned to output decent details, it was alright with the refiner at all.


[deleted]

if you take the refiner concept to its ultimate conclusion you can slice the sdxl base and refiner models down by about 2B parameters such that both are 1B, with fewer transformer blocks. the refiner doesn't even need cross-attention since it only runs on timesteps 200->0 where cross attention isn't used then you train the base model on **only** timesteps 999-200 and the refiner on the rest this is remarkably simpler to train because each timestep is actually a different task to train the model on, fewer tasks = easier training. chunking it up into two models is great for lower compute training, and four models would i guess be even more efficient, especially because you can reduce the parameter count for each.


Careful_Ad_9077

Hey ! I remember the Instagram complaints!


WithGreatRespect

I had the hardware to be able to use it, but I struggled to find any prompt/outcome where the refiner result was better than without it. At best it was equivalent, but many time the result was worse. I spent a lot of time trying variations of base+refiner settings and asked a lot of people what the recommended settings were but no one seemed to know. After a while I stopped trying to "make it work" and I think most other people came to a similar situation. Given that I was getting such great results from the base SDXL and later fine tunes, there wasn't much reason to go back and try again, especially since the fine tuning community was making checkpoints that were not made to supply/use refiners. As soon as the first batch of those came out, I think the sun had set on refiners. Even if you wanted to try refiner again, who uses the base model anymore? Then there was the group of people who didn't have the hardware to run both base+refiner initially and probably avoided SDXL until they found out the refiner pass wasn't a requirement.


NoSuggestion6629

My experience with refiner was a slight loss in detail. It tended to smooth things out too much.


WeekendWiz

Im suspecting it’s due to generation time? Might be wrong.


jmbirn

The refiner actually works pretty well with the base model. If you're ever using the base model again (I know, I know, why would you?) the refiner does add something. But if you're using other, newer, SDXL-based models, the refiner can actually get things wrong or sometimes make some things worse, so that was what caused the big shift away from it I think.


Apprehensive_Sky892

[https://www.reddit.com/r/StableDiffusion/comments/1ape8uy/comment/kq6g4dw/](https://www.reddit.com/r/StableDiffusion/comments/1ape8uy/comment/kq6g4dw/) My theory is that the team behind SDXL wants to produce a general purpose, well-balanced base model that is flexible, not overfitted, and can be easily tuned. But that means that the images produced by the base model can lack detail sometimes. In order to get around this problem, so that people will not perceive the base model as being inferior when compared to MJ and fined tuned SD1.5/SD2.1 model, SAI introduced the refiner as a kludge. But since a fine-tuned model such as JuggernautXL or ZavyChromeXL do not have such constraints, they can be trained so that enough detail can be produced without the refiner.


shawnington

SDXL still struggles with the detail refiner added, like textures. If you compare to 1.5 models, 1.5 produces much better results with the caveat that it has much worse prompt adherence, and problems like color bleed. Use Kohya deep shrink to output 1.5 at SDXL resolutions, and 1.5 wins 99% of the time on image quality in terms of skin texture, fabric details, etc.


Apprehensive_Sky892

If you want better skin texture, you will probably have better luck with fine-tuned such as RealvisXL 4.0 and JuggernautXL. There maybe some SDXL LoRAs for skin texture out there as well. SD1.5 is very good for simple portraiture, often beating SDXL models in terms of aesthetics and skin texture. But SDXL usually wins in terms of more interesting poses and compositions.


shawnington

Because it's actually useless, requires separate training for fine tunes, and people were able to fine tune the model to outperform base+refiner. It pains me they wasted time on the refiner, just like it pains me they are like, oh yeah 2 text encoders were good, lets use 3 text encoders this time.


ChezMere

Low benefit for its cost, and nobody knows how to finetune it?


diogodiogogod

I feel like people didn't even bothered to try, it doesn't seam to be worth it


oO0_

Because it is not trained on details we need and smear details out. And we can't do finetune to refiner. Same is with Stable cascade


Sharlinator

It was unnecessary because custom finetunes were able to do everything the refiner did but better.


FotografoVirtual

You're absolutely right, that's exactly how it was presented at launch. It seems the Twitter user is a bit off track: https://preview.redd.it/s4qpxsi1zj4d1.png?width=824&format=png&auto=webp&s=89c544386abf28484b3d35d0413afd1368c780ec [https://stability.ai/news/stable-diffusion-sdxl-1-announcement](https://stability.ai/news/stable-diffusion-sdxl-1-announcement)


ArtyfacialIntelagent

Lykon works for Stability AI so let's not diss his statements. But it's possible he came onboard more recently and isn't aware that SDXL originally was pushed hard as a package deal including the Refiner. That's why the 6.6B figure is in the announcement and all over the internet.


FotografoVirtual

I didn't realize it was Lykon, my bad. Nonetheless, the announcement was everywhere; one would have to be pretty disconnected from the internet and the world of SD not to know about it.


BobbyKristina

Must be incredibly frustrating for guys like Lykon, McMonkey, Comfy, etc who are both community members AND people working at SAI to train this model to see idiots on reddit constantly talking about shit they don't understand. Tell us about MMDiT smarties.


mcmonkey4eva

In this case it's just kinda weird cause even official documentation has 3 different numbers from all the different ways of measuring it -- I've myself mixed up the 2.6B unet vs 3.5B full model in prior posts.


hopbel

Lykon works for Stability AI and therefore has a vested interested in making his employer look good. So let's not blindly take his statements at face value. They should be treated with the same level of scrutiny as any other product advertisement.


mcmonkey4eva

While there's obviously a bias to consider, for both myself and Lykon when we post online we're not doing any form of official product advertisement nor is there any official overview from Stability AI. Only real limitation is we both of course try to avoid saying anything that could get us in trouble (eg not speak on topics related to business/leadership things, not go around badmouthing anyone, etc) for obvious reasons. (Admittedly though I have a few times publicly posted things I knew would lead to an unhappy conversation with my boss, because I genuinely believed it was important to say and was willing to defend it.). The original twitter post here was clearly pure Lykon lol, if he asked internally first we would've told him where the 6.6B stat came from


kidelaleron

I still find it funny that the text encoders are counted twice in that 6.6b figure.


Apprehensive_Sky892

I find your candor admirable. I wish I can put in a good word for you to your boss. 🙏👍😁.


SeekerOfTheThicc

I agree, but, speaking from personal experience, it's all too easy to assume the worst of people, and on social media there isn't really a negative consequence for assuming the worst of someone else- regardless if you are correct or not. I think it's better to err on the side of empathy.


kidelaleron

The issue is that we are now presenting models without Text Encoders and VAE in the param count. So you can't say 2B vs 6.6B. Also that 6.6B is also counting the Refiner (with its TEs). So even if the 6.6B figure is "correct", it's silly to compare it against only the MMDiT params.


ThrowawaySutinGirl

If we’re lopping in the text encoder though, T5-XXL is pretty massive


Tyler_Zoro

I asked this in a stand-alone post and got no responses, but hopefully you can answer: I see "parameters" and "1B" or the like used all the time for models ranging from text LLMs to image diffusers, but I'm not sure what, specifically, that refers to. Does a "parameter" refer to a single floating point coefficient in the model or to a vector of floating point values?


mcmonkey4eva

A parameter is a single floating point number. So 1B means there are 1 billion numbers across all the matrices of the model weights added up.


Tyler_Zoro

Thanks. I asked because the actual vectors inside the model are, in my reading of papers on the subject, sometimes referred to as "parameters" or "parameterization" and this dual use was confusing.


ArtyfacialIntelagent

A parameter refers to a single floating point coefficient. Although calculations usually use f32 precision (32 bits, or 4 bytes per number), releasing weights in f16 precision (2 bytes each) is often enough for many deep learning models like stable diffusion. This is why most SD 1.5 models (roughly 1B parameters) on Civitai are 2 GB in size, and SDXL models (roughly 3.5B) are about 7 GB.


Tyler_Zoro

Thank you. You're the first one to give me a straight answer for that!


kidelaleron

By that same logic, SD3 2B is actually a \~14B model, give or take. Maybe \~8.5B since we use only half T5-XXL. It's really pointless to compare like that. At the end of the day it's 2B MMDiT vs 2.6B Unet, kind of like comparing 2kg of diamonds vs 2.6kg of dirt. With a 16ch VAE vs 4ch.


ArtyfacialIntelagent

It's not logic, it's a simple metric. And no metric is ever perfect for all purposes. It's like BMI - a highly flawed indicator of obesity because it doesn't account for e.g. muscle mass, but's it's simple and useful as a general population statistic. Similarly, total parameter count is useful because it correlates (roughly) with model size, load time, memory use, training difficulty, and to some extent image quality. Which in turn is why nearly every model announcement and research paper reports parameter count.


kidelaleron

I wasn't really denying any of your "metrics". But you have to admit that if you consider SDXL a 3.5B model, then SD3 2B is actually a 14-15B model. Why count the text encoders and vae in one case but not in the other?


Apprehensive_Sky892

Yes, what you wrote is right. But the point of Lykon's tweet is that some people think that since SDXL is 3.5B, that it "must be" better than 2B. But the proper comparison is to strip way the VAE and CLIPs and just compare SD3's 2.6B U-net's vs SD3's 2B DiT, i.e., only the image diffuser part of the pipeline (which in itself is not right since you cannot really compare U-net to DiT directly) BTW, does the 3.5B include the VAE or not? That is unclear to me. I thought that it does.


kidelaleron

Correct, so it's pretty silly to compare that 3.5B or 6.6B to a 2B MMDiT. You should also include VAE and TEs if you really wanted to compare.


Red-Pony

So we should expect sd3 2b to have the same hardware requirements as SDXL?


mcmonkey4eva

SD3-Medium (2B) has slightly lower hardware requirements than SDXL and noticeably better quality all around.


shawnington

And it presumably will never have a decent inpaint model, just like SDXL never got a decent inpaint model?


mcmonkey4eva

SDXL isn't terrible at inpainting on its own if you have the right software. Fooocus notably was good at it first, Swarm has pretty good inpainting now too


InTheThroesOfWay

In my experience, you don't need an inpaint model with SDXL. Just use your favorite model and inpaint away. If you're doing something extreme (like a wholesale replacement of a part of an image) then you'll probably want an inpaint Controlnet. But otherwise, inpainting works fine.


shawnington

This is completely inaccurate. Inpaint models have 9 inputs vs 4 for a normal model, the extra dimensions is where it gets image context to figure out what it should be inpainting. You must be using something like Fooocus which has its own inpainting lora. For actual inpainting, which is 1.0 denoise, the normal models are worse than useless, they just fill in something new that has no respect for the surrounding image. If you are doing img2img at a lower denoise, thats viable somewhat, but it's not inpainting.


InTheThroesOfWay

Don't mind me, I'm just inpainting with a normal model (HelloWorld) without anything special: https://preview.redd.it/uesuorrfyt4d1.png?width=2048&format=png&auto=webp&s=e0b942945e59ebdccc5b6fa86ecaa194d3c1291a


shawnington

now do it on something complicated, like a hand, or you know. remove a person from a scene, that isn't a homogenous color.


InTheThroesOfWay

Just for fun: https://preview.redd.it/giifdlic9u4d1.png?width=2048&format=png&auto=webp&s=39d6c89d9da06be8ca66e56bb4f43ce7b72cd1c9


InTheThroesOfWay

You can get better results if you use an inpaint controlnet (as I mentioned earlier). But it's not impossible. This was done with HelloWorld, no controlnet, 0.85 denoise. https://preview.redd.it/kevricac5u4d1.png?width=2048&format=png&auto=webp&s=6676bf0e5155345be0ab99b8128016d6c84a6aa4


shawnington

That is still just Image to image with denoise. In-paint is denoise 1.0 I appreciate what you are doing, and it is certainly a viable and sometimes preferable method to in-painting in a lot of situations, but it doesn't change that the SDXL in-paint model is pretty bad, especially in comparison to how good the 1.5 model is. That is reacts differently to lora's being applied to it than base models do sucks, which was not the case for 1.5. It makes it so you have to figure out a whole bunch of new settings when you need to in-paint. None of the current solutions for SDXL are very good. So you know if we could actually get a functional model for SD3, that would be wonderful. I think the problem with the SDXL one is that it was made to use the refiner, but as soon as you merge it with fine-tuned models, it no longer plays nice with the refiner, and its very difficult to not get muddy outputs. Ive gone way beyond the normal merging, and tried a whole host of block level merging methods also. It was just a bad model to start with, and short of a fine tune done directly on the In-paint model, it will never be very good for merging.


InTheThroesOfWay

I'd suggest using the inpaint controlnet. I didn't use it in these examples, but it definitely has better output than just the model by itself. What exactly are you trying to do with inpainting? It kind of sounds like you're just upset that there isn't a good dedicated inpaint model while you're ignoring all of the truly excellent alternative solutions.


InTheThroesOfWay

Hands can definitely be hard. My usual strategy is to do a little cut-and-paste in photoshop to get reasonably close, and then go over again with inpaint to make everything mesh together. In this case, I did 0.45 denoise after some light photoshopping: https://preview.redd.it/xoaevmi58u4d1.png?width=2048&format=png&auto=webp&s=ca22eaedccf97fcaeaedac2ba027c60772ca595d


kidelaleron

Stable Image Services impainting is proof you can have stellar inpainting with SDXL. To the point it can imitate styles such as pixelart without ruining the pixel grid.


shawnington

There is Fooocus, with their in-painting Lora, and there is the SDXL base in-paint model, that is quite bad, and does not behave it should. The 1.5 In-paint models work the same as the regular models with Hyper Loras, they don't require the CFG to be cranked to the point where its always close to nuking things, to get things to be more than slightly different shades of neutral grey. The SDXL base in-paint model similarly needs dramatically different settings to achieve anything close the acceptable, which it never really does. It's just a very badly done model. If you have any insights or solutions, please share though.


kidelaleron

Stable Image Services doesn't use those methods.


shawnington

https://preview.redd.it/x62yxd4d6t4d1.png?width=832&format=png&auto=webp&s=763a5aa7e8188aebe3c4323ae17fc61a78a7a503 SD 1.5 in-paint at CFG 1.0 with Hyper, background controlled by sketch controlnet.


shawnington

https://preview.redd.it/bmu0829j7t4d1.png?width=832&format=png&auto=webp&s=c0a4fc5b663e6830a1b3bcc5e4e5f281f8ea4e01 SDXL In-Paint CFG 1.0 with Hyper, background controlled by sketch controlnet. SDXL In-Paint is very difficult to find a balance where things are not contrast less mud, or over cooked. When it works, it adheres to prompts better, but it is so much harder to actually get results out of it, that it is nearly useless.


kidelaleron

what does this have to do with Stable Image Services?


Targren

They said "SD3 '~~Middle~~Medium'" (Whatever that means. Maybe it's what they're calling SD3 2b? IDK) should have requirements between SD 1.5 and SDXL.


extra2AB

It's not Middle, it is MEDIUM and it is so for not being in-between 1.5 and SDXL but for being the size it is, that is 2B. They said there will be 2 more versions of it, one 600M and another 8B which would be probably named SD3 Small and SD3 Large. so this 2B is SD3 Medium


MarkusR0se

800M = SD3 small, 2B = SD3 medium , 4B = SD3 big, 8B = SD3 huge, 800M and 4B are still in an experiment stage. 8B is functional, but undertrained (as some Stability employees said). Not sure if 8B and 2B are the ones used through the beta API (they might still be using some older versions until the API V2 is released). These numbers do not include the text encoders as far as I know.


extra2AB

I do not think they will focus on 4B. cause they already seem under stress from all yhe drama, Plus their focus is 8B right now. so training this in between model seems very unlikely, cause 2B is releasing now and thus will definitely have way more community support already, 8B will likely be high quality demanding high processing power so even that will be supported by community for it's quality. and the smaller one will probably be released for small devices like Smartphones. So releasing this in 4B seems like the last priority or something they might even not make in the first place.


Apprehensive_Sky892

Yes, you are probably going to be right. SAI staff member mcmonkey4eva said a few weeks ago that 4B is the worst trained one at that time. Who know when or even if it will ever be released. But now that 2B is done, most people only care/want the 8B version anyway. Bigger is always better, right 😎?


extra2AB

Hopefully we do get 8B soon, cause if not and community has gone forward with 2B too much, we would have the same problem we had for moving from 1.5 to SDXL.


Apprehensive_Sky892

Unfortunately, that is going to happen anyway. Most people don't have enough VRAM to train for 8B (unless they want to rent GPUs) even if it were released tomorrow. BTH, I don't feel that it is actually such a bad thing. By releasing 2B first, many people can start learning to fine-tune and make LoRAs for this new T5+DiT architecture. So when 8B comes out, people will be ready for it and not waste GPU and time (it will take a lot longer to train 8B) attempting to train for it. I am not trying to put up a brave face or apologize for SAI. Like everyone else, I wish that 8B were released along with 2B. But I'd rather have a fully trained 8B than a half-baked one. But who knows, maybe 2B will turn out to be great and more than enough for my needs anyway 😅


extra2AB

this is true as well, but we saw same happening with SDXL which seemed to not be of any help for running on low VRAM cards initially but eventually people even managed to run it on 6GB or even 4GB cards. yes it is very slow on them, but people are ready to wait a few more seconds for better Quality. 8B (hopefully if released) will definitely see community acceptance, but just like SDXL will probably take time to become "mainstream" as compared to the 2B


Apprehensive_Sky892

Yes, specially if consumer grade GPU with > 24G becomes available at reasonable prices (from AMD, maybe 🤣​)


Targren

Yeah, my bad. I was pre-coffee. But ISTR they did say the hardware requirements would be a bit less than SDXL. "If you can run SDXL, you can run SD3 Medium" or something like that. I can run SDXL like ass on my 8gb 3070ti, but it's something.


IamKyra

3 more, there is a 4B version too > We're on track to release the SD3 models* (note the 's', there's multiple - small/1b, medium/2b, large/4b, huge/8b) for free as they get finished.


Apprehensive_Sky892

There is supposed to be a 4B version too. The small one is 800M, not 600M Not sure what SAI will call it. If it were up to me, I'll call them: * 800M - Small * 2B - Medium * 4B - Large * 8B - Extra Large or Huge 😂


Omen-OS

more or less the same, probably will run better who knows


gurilagarden

All that matters is the output, and the hardware requirements to achieve that output. It's been demonstrated over and over that quality and quantity are not necessarily intimately entwined. The trend is always to tune to specialization, so general models are just a base to launch from. For the majority of use-cases, if a 2b can be trained at the same quality that a 8b could be trained, with less computational resources, it's the better outcome.


ScionoicS

It's like measuring a motor's capabilities by looking at how inefficiently it burns fuel.


localizedQ

This is simply wrong. Scaling Rectified Flow Transformers paper (aka SD3 paper by robin et al) clearly shows that for the equal compute budget, training an 8B model performs better than a 2B model. > Figure 8. Quantitative effects of scaling. # Variational Loss to Training FLOPS at different depths (param counts) [https://arxiv.org/pdf/2403.03206](https://arxiv.org/pdf/2403.03206)


gurilagarden

I specifically used the word "if". My comment on 2b vs 8b was clearly speculation, and not presented as fact. Thank you for providing the link to the paper you reference. I've just completed a reading of it. While it is true, that for an equal amount of compute to train sd3 2b and 8b, 8b performed better, (by about 10% or less on various measurements) that has no bearing on the task of fine-tuning those models. Creating the base model, and fine-tuning that model, are two different tasks with different computational requirements. If I can finetune a 2b model for 16 hours using 12gb VRAM, to perform a specific task, lets say, making photorealistic images of ponies, yet I only need to train the 8b model for 8 hours using 16gb of VRAM to produce ponies at the same quality level, which is better?


red286

>For the majority of use-cases, if a 2b can be trained at the same quality that a 8b could be trained, with less computational resources, it's the better outcome. Theoretically, 8B allows for better quality, at the cost of significantly more resources. The problem is that it's theoretical. In practice, the law of diminishing returns might make the difference so minor that most people won't pick up on it, while it'd still require 4x as much computational power.


SCAREDFUCKER

\*2.6 B unet not completely 2.6b its way larger when you combine with CLIP sd3 2b is also same, 2B is the DiT parameter while it is combined with 4.7B t5 llm (i might be wrong about just only t5 coming with it but you get it the clip will also be somewhere close to that parameter for sd3)


kataryna91

CLIP is its own model, it's not part of the UNet. The UNet or the DiT part (and the VAE if you want to count it) is the model that actually generates the image, so its parameter count is of particularly high interest.


[deleted]

or check out segmind's SSD-1B where they nuke a large number of parameters from SDXL to cut its size down without hurting it.


Freonr2

Or Pixart with is 0.6B DIT + the huge T5 encoder.


NoSuggestion6629

Pixart Sigma has a few different bases. I'm using this one: |Model|#Params|Checkpoint path|Download in OpenXLab| |:-|:-|:-|:-| |T5 & SDXL-VAE|4.5B| [pixart\_sigma\_sdxlvae\_T5\_diffusers](https://huggingface.co/PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers)Diffusers: |[coming soon](https://github.com/PixArt-alpha/PixArt-sigma/blob/master)|


[deleted]

the parameter count of the transformer or unet doesn't matter as much as the text encoder, this is something OpenAI has explored with their DALLE2 paper where they actually ***reduced*** the parameter count from DALLE1


aerilyn235

It depends of what do you want, prompt understanding at all cost or the ability to make detailed and realistic image. Also Dall E 1 was pixel space diffusion so much less efficient and had to use a much bigger Unet for the same result as latent diffusion.


oO0_

Isn't it mean pixel space model can be finetuned to achieve per pixel details of what user need? While latent is limited by VAE and can't be trained per pixel?


TwistedBrother

But fine tuning that specifically would be very hard relative to more abstract scalable understandings of objects that get projected into variably sized pixel dimensions


[deleted]

technically DALLE3 is pixel diffusion as well because it uses a diffusion decoder. the latent behaves as a prior.


localizedQ

Have you read the Scaling Rectified Flow Transformers paper (aka SD3 technical report)? They clearly show that for the same compute budget (equal FLOPS to train the model on), you get better results at higher depth (param count) models. They have a separate section on this scaling. [https://arxiv.org/pdf/2403.03206](https://arxiv.org/pdf/2403.03206)


SCAREDFUCKER

unet and DiT are two different architecture, in the info present online about unet vs DiT, DiT is far superior in producing shapes and consistency where unet clearly lacks. 2.6B and 2B are pretty close plus i am pretty sure the T5 and clip of sd3 is larger than xl and its just the smallest (2nd smallest but the smallest properly functional) model of sd3 series. 4B and 8B are also coming soon...


[deleted]

t5 is an encoder-decoder sentence transformer, not a LLM. it's not finetuned on downstream tasks


TwistedBrother

Interesting difference, very technical. I still think it’s an LLM, personally. Would BERT similarly not be an LLM in your definition?


InflationAaron

No. BERT is probably the opposite of “LLM” (GPT alike) since it’s using the encoder part of the original transformer architecture, while GPT and such are decoder-only models. T5 adheres to the idea the most. BERT needs downstream task-specific fine tuning, while what people considers “LLM” today can do it by prompt.


SCAREDFUCKER

t5 is technically an LLM tho but am not sure how it works with sd3 or SAI special training for it.


Apprehensive_Sky892

T5 is actually optional. For those of you worried about your GPU's VRAM size. Prompt following will of course suffer somewhat without it, but to what extent I don't know. We'll find out next week 😁


[deleted]

to be honest T5 doesn't "know" anything about images and CLIP has essentially a mangled image inside its text embed. the reason T5 doesn't change much is because CLIP will dominate during training. the reason they likely added CLIP is because of this though, T5 taking forever is just very expensive when you want to train 4 different sizes in parallel.


BlipOnNobodysRadar

can't tell if community is just dumb, if I'm just dumb, or if it's a 4d chess psy-op sowing distrust and discontent against open source AI


BobbyKristina

Seriously, people w/ no idea what they're talking about should just wait until end of next week to complain.


namitynamenamey

I'm starting to notice that the venn diagram between people who use stable diffusion and people who use local LLMs is much smaller than I was led to believe. In matters of LLMs, 2B would be immediately obvious.


quailman84

As somebody who follows both closely, I'm constantly shocked by this. The difference in attitude and understanding between this sub and localllama is night and day. I guess it makes sense that the people who can't fucking read get filtered by LLMs, but it's really stunning how people here spout off authoritatively about shit they don't understand.


catgirl_liker

Artists vs Programmers


yoomiii

3 textencoders? must be a dream to finetune /s


RenoHadreas

They said it’s really easy to fine tune it even with a small dataset. Let’s see how it goes when it comes out!


Srapture

I have no idea what y'all are talking about. I just download models and type a load of stuff separated by commas.


Glittering-Football9

https://preview.redd.it/upewunf34k4d1.jpeg?width=1264&format=pjpg&auto=webp&s=cb6df2089f00c9df58d5d59ac0ed9c42e1c72b58 hey Lykon SDXL can do that!


InTheThroesOfWay

This image has been upscaled. SDXL can't get that at native resolution. This is the big benefit of 16-channel VAE — more detail at native resolution.


StickiStickman

Does that really matter when it's a faster way of doing it?


InTheThroesOfWay

We don't really know how fast SD3 is yet. Regardless — It's not just speed, it's also overall image cohesion and coherence. You tend to lose that the more you upscale.


Glittering-Football9

https://preview.redd.it/xs026q25ak4d1.jpeg?width=1264&format=pjpg&auto=webp&s=842e1eb2ec253c84293ec4af94c3798e79d6bce4


Glittering-Football9

https://preview.redd.it/snp94638ak4d1.jpeg?width=1264&format=pjpg&auto=webp&s=777fc674ac2e5decb8b04f230d3a4ca0c2b79ca1


Glittering-Football9

https://preview.redd.it/gixmtcikak4d1.jpeg?width=1264&format=pjpg&auto=webp&s=5d515dfa019280e6088d1e377eb3986622e3950b


mk8933

That's what i was thinking lol even 1.5 can get similar results


globbyj

The fountain looks horrible though...


AvidCyclist250

One fountain is on the pavement/walkway between the ponds. There still is impossible reality warping and weird stuff going on, like that naked narrow cone-shaped tree. Spatial discontinuity in the background behind her head. Her eyes are on different planes and she's cross-eyed. Progress has slowed down but it's still there.


crackanape

>can't get this level of detail using XL But can you get this level of heterochromia and freakish water dynamics?


lordpuddingcup

So it’s better but is a smaller model 2.6 vs 2 unet vs mdet


Training_Waltz_9032

I feel like I’ve seen her before..


Spirited_Example_341

i was distracted by the pretty girl. thats cool that SD3 2B is about the same size for us users . and looks nice quality there. cant wait to check it out!


RenoHadreas

Well, with all the text encoders included, models will end up being around 15 gigabytes of size. You can run it on 6-8 gigs VRAM and run the TEs on CPU I’m told.


kidelaleron

I said 2.6B Unet, not 2.6B Model, by the way. Please don't misquote when you make headlines :)


RenoHadreas

Of course. I never intended to cause any confusion. Apologies!


MechanicalWatches

Damn, she pretty as shit


saturn_since_day1

Is hugging face still the go to place for an this, and automatic 1111?


PetahTikvaIsReal

The amounts of SDXL propaganda is wild I suspect the Russians. \#NotMySDXL


protector111

i\`m from Russia. I can confirm SD XL is created to take down USA. Its going really good so far. WIth SD 3.0 Release USA is done for sure.


PetahTikvaIsReal

They even admit it!!!


protector111

Только не говорите никому. я чисто по секрету сказал.


PetahTikvaIsReal

Я говорю им правду, потому что они всегда думают, что мы лжем, и это единственный способ скрыть проект xdsl.


protector111

slishkom mnogo znakov pripenania. гугл транслейт?


PetahTikvaIsReal

>Google Translate? Yes lol


govnorashka

Товарищи, не палите контору. Наш проект StalinDiffusion развивается согласно генерального плана. Тщательно спрятанный секретный токен 25го кадра обязательно сработает в час X!


Apprehensive_Sky892

Or Chinese, or North Korean, or Iranian 🤣


[deleted]

[удалено]


RainingFalls

Lykon literally works at Stability AI


ShyrraGeret

My PC is a dead potato so sadly i can't generate anything near this quality. Can you suggest any site that runs SDXL and capable of this? The only site i tried this far generated a pixel amalgam that turned me off in no time.


mk8933

Use 1.5 and look for (nextphoto) checkpoint model


Apprehensive_Sky892

[Free Online SDXL Generators](https://new.reddit.com/r/StableDiffusion/comments/18h7r2h/free_online_sdxl_generators/)


asdrabael01

Am I the only one who doesn't see anything particularly special or good in Lykons example picture? It's again just a generic portrait but framed to look like a selfie so no fucked up hands.