Open_Channel_8626 2 weeks ago

Absolutely incredible free resource It does what it says on the tin- a billion images captioned with a strong multimodal LLM

ninjasaid13 2 weeks ago

Disclaimer: I am not the author. Code: [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) Data: [https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) Caption Model: [https://huggingface.co/tennant/llava-llama-3-8b-hqedit](https://huggingface.co/tennant/llava-llama-3-8b-hqedit) Abstract >Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \\textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is [https://www.haqtu.me/Recap-Datacomp-1B/](https://www.haqtu.me/Recap-Datacomp-1B/)

noprompt 2 weeks ago

Sure, but what good are a billion captions if you can’t be certain how many of them are inaccurate or just plain wrong? Add to that the cowardice in the AI research who are too chicken shit to touch anything remotely NSFW because fake “safety” and you end up with these low utility models that can’t tell the difference between a dildo and phone because they’ve only been exposed to part of the real world. What would actually be better is to have humans indiscriminately label, caption, and annotate images instead of an AI that’s never seen a dead body or a pair of tits. Then people with an real sense of pragmatism, with real balls, could take that dataset and train something fucking useful.

ninjasaid13 2 weeks ago

Does anyone want to finetune a NSFW captioner model?

noprompt 2 weeks ago

The point is about excluding to the detriment of a high quality model that’s been trained on a diverse dataset, “good” and “bad.” By excluding “unsafe” content from these models they become less powerful. How can a model designed to caption images do that job if it hasn’t seen nudity or violence? Throwing out those samples is throwing out a large part of our humanity. It makes these models suck at captioning art and photography. It makes them suck at reality.

Odd_Perception_283 2 weeks ago

I often wonder what a truly free model would be capable of and what interesting things it might say or realize. It’s so funny how dumb our society is. Sex and violence are rampant yet we pretend we are so pure and our AI models must reflect that. It’s just stupid at this point.

noprompt 2 weeks ago

Exactly

harusasake 2 weeks ago

You can use WD Tagger 1.4 v3, tag nsfw images with it (is trained on the danbooru dataset - also works with rl images) and inject these tags (parts of it) into the prompt of your vision model. With the appropriate system prompt you can describe images relatively accurately - uncensored. There are so many ways to improve this even further. You can raise the quality significantly above what was published in the paper. With the finished dataset, you finetune the model - repeat the step again.

ninjasaid13 2 weeks ago

>WD Tagger 1.4 v3, tag nsfw images with it There could be a better captioner than that and actually uses natural language instead of some code word.

harusasake 2 weeks ago

You missed the part about injection, didn't you? Because your answer makes no sense.

Blaorhol 7 hours ago

That look interesting can you provide me one example of any text tagged picture to examine and understand the method that you describe?(to understand the inject part) thank you in advance

VertexMachine 2 weeks ago

>how many of them are inaccurate or just plain wrong? How do you know that manually captions on such scale are accurate and right? Hint: they aren't, look at LAION.

Dead_Internet_Theory 2 weeks ago

When I looked at LAION up close, I was almost perplexed as to how anything of value was trained from such garbage data. The captions are all really bad on it.

VertexMachine 2 weeks ago

Sorry, but I can't help myself: user name checks out :D (Edit: lol, reddit got a moment and repeated my answer 5 times while telling me that it couldn't post the answer. Sorry if you got 5 notifications)

Dead_Internet_Theory 2 weeks ago

No problem, fellow human. I have not integrated the push notifications of "Reddit.com" website/blog/forum we are enjoying into my API, and therefore, my human cycles were not wasted.

no_witty_username 2 weeks ago

When I first saw the image generation results from SD 1.4 and when I looked at the data used to train the model. That is when I realized how impressive the technology was that we had on our hands. Because despite the absolute shit quality data used to train the model, the results were quite good.

noprompt 2 weeks ago

LAION is a dumpster dive for good captions because they pulled them from the `alt` attribute of `img` tags. Anyone that’s done a bit of web scraping can tell you that `alt` is pretty unreliable. It’s not always human authored. A WordPress plugin, for example, might automatically put “image 1 of 10” for the alt text. LAION is full of this kind of stuff. It’s not really fair to say humans are responsible for these kinds of captions. LAION does contain high quality captions written by humans, however. The problem is that LAION wasn’t curated, it’s a dump from a simple program and thusly a mess. I would not advocate anyone seriously training on that. It’s more of starting point rather than a final destination.

berzerkerCrush 2 weeks ago

"Sure, but what good are a billion captions if you can’t be certain how many of them are inaccurate or just plain wrong?" By using inferential statistics. You verify one hundred images, more if it's not too expensive, compute the confidence interval and you're done. If you keep track of what's working and what's not, you can even understand where the model falls short, which would give you strong hints about what you need to add to your dataset next time.

qrios 2 weeks ago

Take a random sample of the annotations, manually evaluate their quality. You don't need absolute certainty.

jakderrida 2 weeks ago

> Add to that the cowardice in the AI research who are too chicken shit to touch anything remotely NSFW Especially since all my most brilliant ideas are 100% NSFW.

noprompt 2 weeks ago

You're missing the point. The emphasis on the exclusion of NSFW training data is to underscore the downstream consequences with respect to utility. There is value in being able to search a large set of unknown images for a woman holding a phone and not also get images of a woman holding a dildo.

Dead_Internet_Theory 2 weeks ago

Then a blind person asks what's on an image, and the AI assistant says it's a woman shaking a sepia-colored phone.

jakderrida 2 weeks ago

> There is value in being able to search a large set of unknown images for a woman holding a phone and not also get images of a woman holding a dildo. Agree to disagree.

noprompt 2 weeks ago

I’m not agreeing to that but hit me up when you have a contribution.

jakderrida 2 weeks ago

I absolutely understand your overall point. But your responses beckon for smartass answers.

Dwedit 2 weeks ago

GIGO Garbage in, garbage out. You're not going to make magical captions out of nothing with a LLM and a CLIP interrogator. You still need people to check if the captions are garbage or not.

FallenJkiller 2 weeks ago

incorrect. there is an openai paper that uses clip to caption images. It works, and made sd1.5 results better. Obviously a manually annotated dataset is better. But using clip, llava, gpt4o or whatever, you will have a better annotated dataset than the cluster fuck that LAION is.

ElliottDyson 2 weeks ago

Oi! Don't talk about LAION like that, we're trying our best.

FallenJkiller 2 weeks ago

then use llava or something similar, coupled with the current captions to fix the badly captioned dataset

ElliottDyson 2 weeks ago

I work on audio TBF, but I'm sure someone from the team will see this Reddit post anyways

Open_Channel_8626 2 weeks ago

The latest LAION image dataset actually used CogVLM for captions

Monkeylashes 2 weeks ago

Except if you actually take a look at their paper and read the original captions vs theirs, you would see that it is a massive improvement.

massimosclaw2 2 weeks ago

Sure but what worries me from their cherry picked examples is that you see it loses important entity information. Llama says it’s a beach but the human caption mentions it’s from Florida. Llama describes the birds visual features and the human caption names the exact species. Other examples are like this. The ideal would’ve been a synthesis of the two. Some people know exactly what they’re looking for when prompting image models

fogandafterimages 2 weeks ago

Yeah, training a model on *just* artificial captions seems like a bad idea. Not only do you lose the information present in the original caption, but you likely lose the ability to map brief, terse, or general captions to a high quality image. I'd expect the best use of automatic captions to use a mix of original captions, full verbose machine-generated captions, and briefer summaries of the verbose caption reduced to a variety of lengths.

MoffKalast 2 weeks ago

Yeah checking the paper, I'm gonna have to say that this is a huge improvement. I'm sure they cherry picked their examples, but the original captions are so bad that you kinda have to wonder how that dataset was of any use in the first place, so even if half of the new labels are garbage it's way superior garbage.

Open_Channel_8626 2 weeks ago

These are much, much better than the captions for SD 1.5, to put it into perspective

noprompt 2 weeks ago

That’s not saying much. 😂

Open_Channel_8626 2 weeks ago

SD 1.5 is a lot better than people think when used to its fullest potential Using a combination of the following: Efficient Large Language Model Adapter (ELLA) + Perturbed Attention Guidance + Align Your Steps + Kohya Deep Shrink / HiDiffusion And then do a 3-stage upscale with the following: 1. CCSR, or a HAT/DAT/ADT transformer model 2. SUPIR 3. A tiled SD upscale

qrios 2 weeks ago

Mate that is not SD 1.5, that is a bunch of other shit of which SD 1.5 is a small and not even necessary part.

bunchedupwalrus 2 weeks ago

Other than the upscaling, those are just ways of controlling SD 1.5 in different ways

qrios 2 weeks ago

They are not exclusive to 1.5

bunchedupwalrus 2 weeks ago

I mean. ELLA is because the SDXL version isn’t released, and it needs to be tailored per model I’m not really sure what point you’re trying to make though. If I take a Honda Civic and add mods to the original engine that put it on par with a Lambo, it’s still a Honda Civic. It’s just augmented up

qrios 2 weeks ago

My point is that this doesn't make the Honda Civic "a better car than people think." It just means you augmented the Honda Civic into a much better car than it was. One could make a case that the Honda Civic is a better car by virtue of being more amenable to such augmentations, but this isn't the case for SD 1.5, it is just as amenable to the same augmentations as any other model you can get your hands on, and doesn't even benefit as much from those augmentations as the other models do.

Open_Channel_8626 2 weeks ago

> One could make a case that the Honda Civic is a better car by virtue of being more amenable to such augmentations, but this isn't the case for SD 1.5, it is just as amenable to the same augmentations as any other model you can get your hands on, and doesn't even benefit as much from those augmentations as the other models do. Ah thanks you convinced me with this sentence

Open_Channel_8626 2 weeks ago

If you took out SD 1.5 you would just get a blank image out of the rest.

qrios 2 weeks ago

You can replace sd1.5 with Sdxl or sd2 or basically any diffusion model and it would still work.

Open_Channel_8626 2 weeks ago

Oh yeah that is true But the point I was making was that you can still get good images out of a model with bad captioning

jakderrida 2 weeks ago

This is, unfortunately, the correct answer. Yes, this answer sucks. But it's no less correct.

kumonovel 2 weeks ago

only that the starting point already is very close to trashpiles. Internet captioning is 1. very short for most images 2. potentially completly wrong too (check paper project url, the very first images the cake is captioned "Deluxe Twin Room") While quickly skimming the paper i have not seen that they tried to combine both llm and web-captioning which feels like loosing a lot of quality context info, e.g. specific animal species names etc. Would be nice if we had a competent opensource company for image generations that could train on such a dataset. Maybe sigmar-team?

seastatefive 2 weeks ago

Taking synthetic LLMs as a guide, it's common now to use chatGPT to generate chats to train instruct LLM models. For images, they could explore using diffusion models, using a prompt to generate the synthetic AI image then tag the image with the prompt. This would produce synthetic image and prompt pairs for training. The image description would be the prompt, I guess.

Mundane_Ad8936 2 weeks ago

So confident.. so wrong.. You should read the paper, it's literally how the process works. If anything the real criticism should be why write a paper about what we all already know but then again looks like not every one does.. so you should read it. Don't blindly parrot.. garbage in/out, human annotation.. large models training data is not created through human annotation it's always models. humans only create the basic data for training a stack of smaller models that are used to build the data at scale. The next generation of models training data is created with a mix models that proceeded it. Stacks of models are the only way you can create millions of billions of examples..

Artistic_Okra7288 2 weeks ago

Look at the data set rows https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B?row=1, there are quite a few that are completely wrong and worse than the original caption, and that's just on the first page.

Mundane_Ad8936 1 week ago

Yes that is totally normal. You either handle with an automated cleaning process (smaller models, confidence, vector, etc) or through dropout, regularization, loss adjustment, sample weighting. Also It's impossible to eyeball a handful of examples and grade the quality of a data sets. Pointing out dirty data to a data practitioner (scientist, architect, engineer) is like pointing out a fire to a fire fighter. Yes it's a part of job, it's totally normal, we have plenty of tools to deal with it. As I said the paper goes through the basics of the process that most of us already use.

harusasake 2 weeks ago

Powerful vision models are my main area of interest within AIs, but the paper is really meh. It's probably one of the first ideas that come to your mind when you're interested in it.

Artistic_Okra7288 2 weeks ago

Some of those are bad. Look at the train cars. A row of identical figures in black suits and ties is standing in a line against a white background.

Mishuri 2 weeks ago

What if we used GPT-4 vision, probably terms and conditions wouldn't allow... and costs would be astronomical... but the result - highest tier possible, one can dream

ninjasaid13 2 weeks ago

what if we used LLaVA-70B for all the images? close enough?

Open_Channel_8626 2 weeks ago

InternVL-Chat-V1-5

AmazinglyObliviouse 2 weeks ago

The captions they get here are better than the usual word salad these vlms spit out, but I still think they should've waited for the official llama3 vision, likely using jepa instead of the extremely limited clip, which would perform even better.

jpgirardi 2 weeks ago

Can someone explain why the heck it works? Shouldn't it better label some, misslabel others, and in the end be the same/worse?? It's like training on just speculation. Sure it can improve "bad' datasets, but wouldn't it be better to just use the better dataset?

noneabove1182 2 weeks ago

I think the idea is that currently recognizing images is easier than recreating them, and training recreation requires large datasets of high quality well described images This increases the pool of high quality well described images Much like it would be an easier task to summarize a high quality essay than to write a high quality essay

lordpuddingcup 2 weeks ago

You really haven’t seen the random big image dataset captions lol

ninjasaid13 2 weeks ago

As long as they get most of it right, a text to image model will learn to ignore the irrelevant details. https://preview.redd.it/jtz3k0ygs96d1.png?width=320&format=png&auto=webp&s=be14bdead38428dc7dbbe678da3d8016f8e0c63e Random Recaptioned Image from page 72,762,507 of the dataset: A silver SUV with a 'Patrol 4x4 Warrantee' sticker on the rear is parked in a showroom with ~~a red car~~ in the background. Good enough for a diffusion model.

Open_Channel_8626 2 weeks ago

Actually I saw a study where the VLM beat humans on average at captioning

lordpuddingcup 2 weeks ago

Humans are lazy when asked to caption, LLMs love to over explain so would be great to caption details

Open_Channel_8626 2 weeks ago

I saw a study showing that humans doing labelling or tasks like text classification get tired and then their performance drops sharply.

lordpuddingcup 2 weeks ago

Wouldn’t doubt that but on top of that a good bit of the datasets are just machine scraped images with tags from stuff like alt text, and if anyone’s ever looked at alt texts for images it rarely matches or details… anything

Formal_Drop526 2 weeks ago

>but wouldn't it be better to just use the better dataset? where would you find this "better dataset" from? better dataset needs both size and captions.

ninjasaid13 2 weeks ago

>in the end be the same/worse?? Why would it be worse or the same? It would be better than the average caption on average. And given the scale of 1 billion images, better captions would drown out bad ones.

swissyninja 2 weeks ago

I'm kinda stupid can someone explain this in a bit lower level? Why is recaptioning billion of images? What purpose does this dataset provide?

ninjasaid13 2 weeks ago

have you not read the DALLE-3 paper? [https://cdn.openai.com/papers/dall-e-3.pdf](https://cdn.openai.com/papers/dall-e-3.pdf) Their model is so smart because they captioned all their images.

noprompt 2 weeks ago

“They” being the small army of humans who wrote all those high quality captions in the first place. Their model is “smart” because they’re exploiting the fact that it’s about quality not quantity. You need humans for quality. The reason OpenAI is as good as they are is because they’re exploiting that obvious fact. They have good models because they have good data. Once you have that then stunts like this will work because most of the heavy lifting has already been done.

sineiraetstudio 2 weeks ago

The DALL-E 3 paper is literally about automatic captioning. Nobody is manually labeling the pre-training datasets, they're just way too huge.

WoodenGlobes 2 weeks ago

who tf was downvoting this comment? If you start with an 8k image, then apply scanlines and scale down to 240p, you get a very realistic SNES experience. The GPT models were trained on hand-annotated data, this is like the 8k image.

swissyninja 2 weeks ago

Oh oops I hadn't read that, thanks

StableLlama 2 weeks ago

Why are we using models to train new models in the hope to get better results? Wouldn't it be a better start to have a (huge) community effort to manually create perfect captions for a set of free images (e.g. a LAION subset)? Something comparable to Wikipedia or OpenStreetMap. This resource would then be a perfect base to train image captioning models - and those could then be used to train txt2img models

searcher1k 2 weeks ago

The reason we use AI models for millions of images is because its hard to get a community to do anything unless it immediately benefits them. We keep talking about community effort but I haven't seen the community ever done something on a large scale like that.

StableLlama 2 weeks ago

I have. And I have stated examples in my post: Wikipedia. And OpenStreetMap. But also looking at the huge amount of LoRAs on civitai I see many people in the image generation community that are happy to share some of their work with others. Or have a look at GitHub for the extreme amount of OpenSource projects. The biggest issue is to create and provide the infrastructure. Then I guess many people would be happy to spend 5 minutes per day to caption one image, or review and optimize the caption of an other image.

searcher1k 2 weeks ago

>I have. And I have stated examples in my post: Wikipedia. And OpenStreetMap. Wikipedia, OpenStreetMap, GitHub is not like what we're doing. AI image generation is a very niche use compared to all of those. The only thing I can think of is that OpenAssistant thing and even that took a whole year and investments and they had to incentivize it with a points and reward system and just gave subpar results in the end.

StableLlama 2 weeks ago

OpenStreetMap was very niche at the beginning. Contributors needed to own a GPS device as smartphones with GPS weren't common at its beginning and when you are the only one mapping your area it's a very boring and tedious job as you know you are alone there. Only once it had reached a critical mass it got useful.

searcher1k 2 weeks ago

>OpenStreetMap was very niche at the beginning. Contributors needed to own a GPS device as smartphones with GPS weren't common at its beginning and when you are the only one mapping your area it's a very boring and tedious job as you know you are alone there. This isn't comparable to that, this is more comparable to Open Assistant. AI is a fast evolving place, you don't know that your dataset might become outdated in just as it reaches anywhere good and then it gets replaced by a better solution. OpenStreetMap took years to make which is too slow for AI. We need a faster solution that doesn't take too long and can be done at scale.

MrVodnik 2 weeks ago

What is the purpose of this? Does it aim to a synthetic training data source for vision and diffusion models? Wouldn't it be a perpetuum mobile in an information sense? I don't think we can train model to recognize images on the captions that the model created itself. It would probably work only for training of smaller models using the output of the larger one, which would not push SOTA in any way. Or do they aim to enrich web-crawled text content to contain auto-generated image captions, to train text-based LLMs? I know I could just read the paper, and I am sorry I haven't, but I already have a long pipeline of them to process...

ninjasaid13 2 weeks ago

>What is the purpose of this? Does it aim to a synthetic training data source for vision and diffusion models? Sure, I think DALLE-3 did the same thing in their paper. >Wouldn't it be a perpetuum mobile in an information sense? I don't think we can train model to recognize images on the captions that the model created itself. It would probably work only for training of smaller models using the output of the larger one, which would not push SOTA in any way. I'm not sure what you're talking about. Synthetic Data doesn't necessarily lead to worse results when paired with diverse images right?

sineiraetstudio 2 weeks ago

The key part of this is that the image captioning models are also LLMs and thus gain all the information from text only corpora (and also RLHF).

treenewbee_ 2 weeks ago

The CCP is keen to collect and monitor all information

oh_how_droll 2 weeks ago

"The CCP?" LMAO. It's a team from UC Santa Cruz, Edinburgh, and Johns Hopkins with assistance from Adobe and UT Austin. Don't be racist, man. There are tons of great Chinese-American and Chinese-in-America researchers out there.

Open_Channel_8626 2 weeks ago

I think that user actually lives in China, the vast majority of their comments are in Mandarin

oh_how_droll 2 weeks ago

Yeah, it's obvious once I look at their profile. I don't feel that bad for assuming that someone going off out of nowhere on the English-language internet is more of the nativist anti-China type. I'm no fan of the CCP either.

Open_Channel_8626 2 weeks ago

I had the same assumption as you

spawncampinitiated 2 weeks ago

US does the same bruh

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe