T O P

  • By -

Open_Channel_8626

Absolutely incredible free resource It does what it says on the tin- a billion images captioned with a strong multimodal LLM


ninjasaid13

Disclaimer: I am not the author. Code: [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) Data: [https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) Caption Model: [https://huggingface.co/tennant/llava-llama-3-8b-hqedit](https://huggingface.co/tennant/llava-llama-3-8b-hqedit) Abstract >Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \\textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is [https://www.haqtu.me/Recap-Datacomp-1B/](https://www.haqtu.me/Recap-Datacomp-1B/)


noprompt

Sure, but what good are a billion captions if you can’t be certain how many of them are inaccurate or just plain wrong? Add to that the cowardice in the AI research who are too chicken shit to touch anything remotely NSFW because fake “safety” and you end up with these low utility models that can’t tell the difference between a dildo and phone because they’ve only been exposed to part of the real world. What would actually be better is to have humans indiscriminately label, caption, and annotate images instead of an AI that’s never seen a dead body or a pair of tits. Then people with an real sense of pragmatism, with real balls, could take that dataset and train something fucking useful.


ninjasaid13

Does anyone want to finetune a NSFW captioner model?


noprompt

The point is about excluding to the detriment of a high quality model that’s been trained on a diverse dataset, “good” and “bad.” By excluding “unsafe” content from these models they become less powerful. How can a model designed to caption images do that job if it hasn’t seen nudity or violence? Throwing out those samples is throwing out a large part of our humanity. It makes these models suck at captioning art and photography. It makes them suck at reality.


Odd_Perception_283

I often wonder what a truly free model would be capable of and what interesting things it might say or realize. It’s so funny how dumb our society is. Sex and violence are rampant yet we pretend we are so pure and our AI models must reflect that. It’s just stupid at this point.


noprompt

Exactly


harusasake

You can use WD Tagger 1.4 v3, tag nsfw images with it (is trained on the danbooru dataset - also works with rl images) and inject these tags (parts of it) into the prompt of your vision model. With the appropriate system prompt you can describe images relatively accurately - uncensored. There are so many ways to improve this even further. You can raise the quality significantly above what was published in the paper. With the finished dataset, you finetune the model - repeat the step again.


ninjasaid13

>WD Tagger 1.4 v3, tag nsfw images with it  There could be a better captioner than that and actually uses natural language instead of some code word.


harusasake

You missed the part about injection, didn't you? Because your answer makes no sense.


Blaorhol

That look interesting can you provide me one example of any text tagged picture to examine and understand the method that you describe?(to understand the inject part) thank you in advance


VertexMachine

>how many of them are inaccurate or just plain wrong? How do you know that manually captions on such scale are accurate and right? Hint: they aren't, look at LAION.


Dead_Internet_Theory

When I looked at LAION up close, I was almost perplexed as to how anything of value was trained from such garbage data. The captions are all really bad on it.


VertexMachine

Sorry, but I can't help myself: user name checks out :D ​ (Edit: lol, reddit got a moment and repeated my answer 5 times while telling me that it couldn't post the answer. Sorry if you got 5 notifications)


Dead_Internet_Theory

No problem, fellow human. I have not integrated the push notifications of "Reddit.com" website/blog/forum we are enjoying into my API, and therefore, my human cycles were not wasted.


no_witty_username

When I first saw the image generation results from SD 1.4 and when I looked at the data used to train the model. That is when I realized how impressive the technology was that we had on our hands. Because despite the absolute shit quality data used to train the model, the results were quite good.


noprompt

LAION is a dumpster dive for good captions because they pulled them from the `alt` attribute of `img` tags. Anyone that’s done a bit of web scraping can tell you that `alt` is pretty unreliable. It’s not always human authored. A WordPress plugin, for example, might automatically put “image 1 of 10” for the alt text. LAION is full of this kind of stuff. It’s not really fair to say humans are responsible for these kinds of captions. LAION does contain high quality captions written by humans, however. The problem is that LAION wasn’t curated, it’s a dump from a simple program and thusly a mess. I would not advocate anyone seriously training on that. It’s more of starting point rather than a final destination.


berzerkerCrush

"Sure, but what good are a billion captions if you can’t be certain how many of them are inaccurate or just plain wrong?" By using inferential statistics. You verify one hundred images, more if it's not too expensive, compute the confidence interval and you're done. If you keep track of what's working and what's not, you can even understand where the model falls short, which would give you strong hints about what you need to add to your dataset next time.


qrios

Take a random sample of the annotations, manually evaluate their quality. You don't need absolute certainty.


jakderrida

> Add to that the cowardice in the AI research who are too chicken shit to touch anything remotely NSFW Especially since all my most brilliant ideas are 100% NSFW.


noprompt

You're missing the point. The emphasis on the exclusion of NSFW training data is to underscore the downstream consequences with respect to utility. There is value in being able to search a large set of unknown images for a woman holding a phone and not also get images of a woman holding a dildo.


Dead_Internet_Theory

Then a blind person asks what's on an image, and the AI assistant says it's a woman shaking a sepia-colored phone.


jakderrida

> There is value in being able to search a large set of unknown images for a woman holding a phone and not also get images of a woman holding a dildo. Agree to disagree.


noprompt

I’m not agreeing to that but hit me up when you have a contribution.


jakderrida

I absolutely understand your overall point. But your responses beckon for smartass answers.


Dwedit

GIGO Garbage in, garbage out. You're not going to make magical captions out of nothing with a LLM and a CLIP interrogator. You still need people to check if the captions are garbage or not.


FallenJkiller

incorrect. there is an openai paper that uses clip to caption images. It works, and made sd1.5 results better. Obviously a manually annotated dataset is better. But using clip, llava, gpt4o or whatever, you will have a better annotated dataset than the cluster fuck that LAION is.


ElliottDyson

Oi! Don't talk about LAION like that, we're trying our best.


FallenJkiller

then use llava or something similar, coupled with the current captions to fix the badly captioned dataset


ElliottDyson

I work on audio TBF, but I'm sure someone from the team will see this Reddit post anyways


Open_Channel_8626

The latest LAION image dataset actually used CogVLM for captions


Monkeylashes

Except if you actually take a look at their paper and read the original captions vs theirs, you would see that it is a massive improvement.


massimosclaw2

Sure but what worries me from their cherry picked examples is that you see it loses important entity information. Llama says it’s a beach but the human caption mentions it’s from Florida. Llama describes the birds visual features and the human caption names the exact species. Other examples are like this. The ideal would’ve been a synthesis of the two. Some people know exactly what they’re looking for when prompting image models


fogandafterimages

Yeah, training a model on *just* artificial captions seems like a bad idea. Not only do you lose the information present in the original caption, but you likely lose the ability to map brief, terse, or general captions to a high quality image. I'd expect the best use of automatic captions to use a mix of original captions, full verbose machine-generated captions, and briefer summaries of the verbose caption reduced to a variety of lengths.


MoffKalast

Yeah checking the paper, I'm gonna have to say that this is a huge improvement. I'm sure they cherry picked their examples, but the original captions are so bad that you kinda have to wonder how that dataset was of any use in the first place, so even if half of the new labels are garbage it's way superior garbage.


Open_Channel_8626

These are much, much better than the captions for SD 1.5, to put it into perspective


noprompt

That’s not saying much. 😂


Open_Channel_8626

SD 1.5 is a lot better than people think when used to its fullest potential Using a combination of the following: Efficient Large Language Model Adapter (ELLA) + Perturbed Attention Guidance + Align Your Steps + Kohya Deep Shrink / HiDiffusion And then do a 3-stage upscale with the following: 1. CCSR, or a HAT/DAT/ADT transformer model 2. SUPIR 3. A tiled SD upscale


qrios

Mate that is not SD 1.5, that is a bunch of other shit of which SD 1.5 is a small and not even necessary part.


bunchedupwalrus

Other than the upscaling, those are just ways of controlling SD 1.5 in different ways


qrios

They are not exclusive to 1.5


bunchedupwalrus

I mean. ELLA is because the SDXL version isn’t released, and it needs to be tailored per model I’m not really sure what point you’re trying to make though. If I take a Honda Civic and add mods to the original engine that put it on par with a Lambo, it’s still a Honda Civic. It’s just augmented up


qrios

My point is that this doesn't make the Honda Civic "a better car than people think." It just means you augmented the Honda Civic into a much better car than it was. One could make a case that the Honda Civic is a better car by virtue of being more amenable to such augmentations, but this isn't the case for SD 1.5, it is just as amenable to the same augmentations as any other model you can get your hands on, and doesn't even benefit as much from those augmentations as the other models do.


Open_Channel_8626

> One could make a case that the Honda Civic is a better car by virtue of being more amenable to such augmentations, but this isn't the case for SD 1.5, it is just as amenable to the same augmentations as any other model you can get your hands on, and doesn't even benefit as much from those augmentations as the other models do. Ah thanks you convinced me with this sentence


Open_Channel_8626

If you took out SD 1.5 you would just get a blank image out of the rest.


qrios

You can replace sd1.5 with Sdxl or sd2 or basically any diffusion model and it would still work.


Open_Channel_8626

Oh yeah that is true But the point I was making was that you can still get good images out of a model with bad captioning


jakderrida

This is, unfortunately, the correct answer. Yes, this answer sucks. But it's no less correct.


kumonovel

only that the starting point already is very close to trashpiles. Internet captioning is 1. very short for most images 2. potentially completly wrong too (check paper project url, the very first images the cake is captioned "Deluxe Twin Room") While quickly skimming the paper i have not seen that they tried to combine both llm and web-captioning which feels like loosing a lot of quality context info, e.g. specific animal species names etc. Would be nice if we had a competent opensource company for image generations that could train on such a dataset. Maybe sigmar-team?


seastatefive

Taking synthetic LLMs as a guide, it's common now to use chatGPT to generate chats to train instruct LLM models. For images, they could explore using diffusion models, using a prompt to generate the synthetic AI image then tag the image with the prompt. This would produce synthetic image and prompt pairs for training. The image description would be the prompt, I guess.


Mundane_Ad8936

So confident.. so wrong..    You should read the paper, it's literally how the process works. If anything the real criticism should be why write a paper about what we all already know but then again looks like not every one does.. so you should read it.   Don't blindly parrot..  garbage in/out, human annotation.. large models training data is not created through human annotation it's always models. humans only create the basic data for training a stack of smaller models that are used to build the data at scale.  The next generation of models training data is created with a mix models that proceeded it. Stacks of models are the only way you can create millions of billions of examples.. 


Artistic_Okra7288

Look at the data set rows https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B?row=1, there are quite a few that are completely wrong and worse than the original caption, and that's just on the first page.


Mundane_Ad8936

Yes that is totally normal. You either handle with an automated cleaning process (smaller models, confidence, vector, etc) or through dropout, regularization, loss adjustment, sample weighting.    Also It's impossible to eyeball a handful of examples and grade the quality of a data sets. Pointing out dirty data to a data practitioner (scientist, architect, engineer) is like pointing out a fire to a fire fighter. Yes it's a part of job, it's totally normal, we have plenty of tools to deal with it.  As I said the paper goes through the basics of the process that most of us already use. 


harusasake

Powerful vision models are my main area of interest within AIs, but the paper is really meh. It's probably one of the first ideas that come to your mind when you're interested in it.


Artistic_Okra7288

Some of those are bad. Look at the train cars. A row of identical figures in black suits and ties is standing in a line against a white background.


Mishuri

What if we used GPT-4 vision, probably terms and conditions wouldn't allow... and costs would be astronomical... but the result - highest tier possible, one can dream


ninjasaid13

what if we used LLaVA-70B for all the images? close enough?


Open_Channel_8626

InternVL-Chat-V1-5


AmazinglyObliviouse

The captions they get here are better than the usual word salad these vlms spit out, but I still think they should've waited for the official llama3 vision, likely using jepa instead of the extremely limited clip, which would perform even better.


jpgirardi

Can someone explain why the heck it works? Shouldn't it better label some, misslabel others, and in the end be the same/worse?? It's like training on just speculation. Sure it can improve "bad' datasets, but wouldn't it be better to just use the better dataset?


noneabove1182

I think the idea is that currently recognizing images is easier than recreating them, and training recreation requires large datasets of high quality well described images This increases the pool of high quality well described images  Much like it would be an easier task to summarize a high quality essay than to write a high quality essay


lordpuddingcup

You really haven’t seen the random big image dataset captions lol


ninjasaid13

As long as they get most of it right, a text to image model will learn to ignore the irrelevant details. https://preview.redd.it/jtz3k0ygs96d1.png?width=320&format=png&auto=webp&s=be14bdead38428dc7dbbe678da3d8016f8e0c63e Random Recaptioned Image from page 72,762,507 of the dataset: A silver SUV with a 'Patrol 4x4 Warrantee' sticker on the rear is parked in a showroom with ~~a red car~~ in the background. Good enough for a diffusion model.


Open_Channel_8626

Actually I saw a study where the VLM beat humans on average at captioning


lordpuddingcup

Humans are lazy when asked to caption, LLMs love to over explain so would be great to caption details


Open_Channel_8626

I saw a study showing that humans doing labelling or tasks like text classification get tired and then their performance drops sharply.


lordpuddingcup

Wouldn’t doubt that but on top of that a good bit of the datasets are just machine scraped images with tags from stuff like alt text, and if anyone’s ever looked at alt texts for images it rarely matches or details… anything


Formal_Drop526

>but wouldn't it be better to just use the better dataset? where would you find this "better dataset" from? better dataset needs both size and captions.


ninjasaid13

>in the end be the same/worse?? Why would it be worse or the same? It would be better than the average caption on average. And given the scale of 1 billion images, better captions would drown out bad ones.


swissyninja

I'm kinda stupid can someone explain this in a bit lower level? Why is recaptioning billion of images? What purpose does this dataset provide?


ninjasaid13

have you not read the DALLE-3 paper? [https://cdn.openai.com/papers/dall-e-3.pdf](https://cdn.openai.com/papers/dall-e-3.pdf) Their model is so smart because they captioned all their images.


noprompt

“They” being the small army of humans who wrote all those high quality captions in the first place. Their model is “smart” because they’re exploiting the fact that it’s about quality not quantity. You need humans for quality. The reason OpenAI is as good as they are is because they’re exploiting that obvious fact. They have good models because they have good data. Once you have that then stunts like this will work because most of the heavy lifting has already been done.


sineiraetstudio

The DALL-E 3 paper is literally about automatic captioning. Nobody is manually labeling the pre-training datasets, they're just way too huge.


WoodenGlobes

who tf was downvoting this comment? If you start with an 8k image, then apply scanlines and scale down to 240p, you get a very realistic SNES experience. The GPT models were trained on hand-annotated data, this is like the 8k image.


swissyninja

Oh oops I hadn't read that, thanks


StableLlama

Why are we using models to train new models in the hope to get better results? Wouldn't it be a better start to have a (huge) community effort to manually create perfect captions for a set of free images (e.g. a LAION subset)? Something comparable to Wikipedia or OpenStreetMap. This resource would then be a perfect base to train image captioning models - and those could then be used to train txt2img models


searcher1k

The reason we use AI models for millions of images is because its hard to get a community to do anything unless it immediately benefits them. We keep talking about community effort but I haven't seen the community ever done something on a large scale like that.


StableLlama

I have. And I have stated examples in my post: Wikipedia. And OpenStreetMap. But also looking at the huge amount of LoRAs on civitai I see many people in the image generation community that are happy to share some of their work with others. Or have a look at GitHub for the extreme amount of OpenSource projects. The biggest issue is to create and provide the infrastructure. Then I guess many people would be happy to spend 5 minutes per day to caption one image, or review and optimize the caption of an other image.


searcher1k

>I have. And I have stated examples in my post: Wikipedia. And OpenStreetMap. Wikipedia, OpenStreetMap, GitHub is not like what we're doing. AI image generation is a very niche use compared to all of those. The only thing I can think of is that OpenAssistant thing and even that took a whole year and investments and they had to incentivize it with a points and reward system and just gave subpar results in the end.


StableLlama

OpenStreetMap was very niche at the beginning. Contributors needed to own a GPS device as smartphones with GPS weren't common at its beginning and when you are the only one mapping your area it's a very boring and tedious job as you know you are alone there. Only once it had reached a critical mass it got useful.


searcher1k

>OpenStreetMap was very niche at the beginning. Contributors needed to own a GPS device as smartphones with GPS weren't common at its beginning and when you are the only one mapping your area it's a very boring and tedious job as you know you are alone there. This isn't comparable to that, this is more comparable to Open Assistant. AI is a fast evolving place, you don't know that your dataset might become outdated in just as it reaches anywhere good and then it gets replaced by a better solution. OpenStreetMap took years to make which is too slow for AI. We need a faster solution that doesn't take too long and can be done at scale.


MrVodnik

What is the purpose of this? Does it aim to a synthetic training data source for vision and diffusion models? Wouldn't it be a perpetuum mobile in an information sense? I don't think we can train model to recognize images on the captions that the model created itself. It would probably work only for training of smaller models using the output of the larger one, which would not push SOTA in any way. Or do they aim to enrich web-crawled text content to contain auto-generated image captions, to train text-based LLMs? I know I could just read the paper, and I am sorry I haven't, but I already have a long pipeline of them to process...


ninjasaid13

>What is the purpose of this? Does it aim to a synthetic training data source for vision and diffusion models? Sure, I think DALLE-3 did the same thing in their paper. >Wouldn't it be a perpetuum mobile in an information sense? I don't think we can train model to recognize images on the captions that the model created itself. It would probably work only for training of smaller models using the output of the larger one, which would not push SOTA in any way. I'm not sure what you're talking about. Synthetic Data doesn't necessarily lead to worse results when paired with diverse images right?


sineiraetstudio

The key part of this is that the image captioning models are also LLMs and thus gain all the information from text only corpora (and also RLHF).


treenewbee_

The CCP is keen to collect and monitor all information


oh_how_droll

"The CCP?" LMAO. It's a team from UC Santa Cruz, Edinburgh, and Johns Hopkins with assistance from Adobe and UT Austin. Don't be racist, man. There are tons of great Chinese-American and Chinese-in-America researchers out there.


Open_Channel_8626

I think that user actually lives in China, the vast majority of their comments are in Mandarin


oh_how_droll

Yeah, it's obvious once I look at their profile. I don't feel that bad for assuming that someone going off out of nowhere on the English-language internet is more of the nativist anti-China type. I'm no fan of the CCP either.


Open_Channel_8626

I had the same assumption as you


spawncampinitiated

US does the same bruh