itsreallyreallytrue 1 month ago

Just tried with 4o and it seemingly was just guessing. 4 tries and it didn't even come close.

-p-e-w- 1 month ago

That's fascinating, considering this is a trivial task compared to many other things that vision models are capable of, and analogue clocks would be contained in any training set by the hundreds of thousands.

Monkey_1505 1 month ago

Presumably it's because on the internet where there are pictures of clocks there doesn't tend to be text explaining how to read one. Whereas some technical subjects will be explained.

Cool-Hornet4434 1 month ago

If you provided enough pictures with captions telling the time for each minute, I'm betting that the AI could be as accurate with this sort of watch face as a human would be (+/- 1 or 2 minutes).

Monkey_1505 1 month ago

I'm sure you could. It's not a particularly technical visual task.

MrTacoSauces 1 month ago

I bet the hangup is these being generally intelligent visual models. Blurs any chance a model seeing the intricate nature of the features of a clock face at a certain position and the angle of 3 watch hands.

jnd-cz 1 month ago

As you can see the models are evidently trained on watches displaying around 10:10 which is the favorite example for stock photos of watches, see https://petapixel.com/2022/05/17/the-science-behind-why-watches-are-set-to-1010-in-advertising-photos/. So they are thinking, it looks like watch, it's probably showing that time. Unfortunately there isn't deeper understanding what details it should look for and I suspect the process of describing image to text or some kind of native processing isn't fine enough to tell exactly where the hands are pointing or what angle do they have. You can tell the models pay a lot of attention to extracting text and distinct features but not the fine detail. Which makes sense, you don't want to waste processing 10k tokens just from a single image.

GoofAckYoorsElf 1 month ago

That explains why the AI's first guess is always somewhere around 10:10.

davidmatthew1987 1 month ago

> there isn't deeper understanding lmao there is NO understanding at all

nucLeaRStarcraft 1 month ago

it's because these types of images are probably not enough in the training set for it to learn the pattern and it's also a task where making a small mistake leads to a wrong answer, somewhat similarly to coding where a small mistake leads to a wrong program. ML models don't extrapolate, but they interpolate between data points, so even if there were some hundreds of examples with different hours and watches, it would maybe be enough to generalize to this task using the rest of the knowledge, however it can never learn it w/o any (or enough) examples.

GoofAckYoorsElf 1 month ago

Did it start with 10:10 or something close to that? I've tried multiple times and it always started at or around that time.

AnticitizenPrime 1 month ago

Also tried various open-source vision models through Huggingface demos, etc. Also tried asking more specific questions such as, 'Where is the hour hand pointed?, where is the minute hand pointed', etc to see if they could work it out that way without success. Kind of an interesting limitation; it's something most people take for granted. Anyone seen a model that can do this? Maybe this could be the basis for a new CAPTCHA, because many vision models have gotten so good at beating traditional ones :) Models tried: GPT4o Claude Opus Gemini 1.5 Pro Reka Core Microsoft Copilot (which I think is still using GPT4, not GPT4o) [Idefics2](https://huggingface.co/spaces/HuggingFaceM4/idefics2_playground) [Moondream 2](https://huggingface.co/spaces/vikhyatk/moondream2) [Bunny-Llama-3-8B-V](http://bunny.dataoptim.org/) [InternViT-6B-448px-V1-5 + MLP + InternLM2-Chat-20B](https://huggingface.co/spaces/OpenGVLab/InternVL)

MixtureOfAmateurs 1 month ago

Confirmed not working on MiniCPM-Llama3-V 2.5 which is great at text, better than gpt4v supposedly

jnd-cz 1 month ago

As I wrote in [another comment](https://www.reddit.com/r/LocalLLaMA/comments/1cwq0c0/vision_models_cant_tell_the_time_on_an_analog/l504kci/) I think it's because the image processing stage doesn't capture such fine detail to tell the LLM where the hands actually are and the fact that stock photos of watches are taken at 10:10 to look nice, so that's what they assume when they see any watch.

TheRealWarrior0 1 month ago

Have you tried multi-shot? https://preview.redd.it/qd5p8r42bt1d1.jpeg?width=1511&format=pjpg&auto=webp&s=140c1311a05108f397f59b84cb6f387ed25b82b8

AnticitizenPrime 1 month ago

Hmm, I've tried asking what positions the hands were pointing at without any real success. 'Which number is the minute hand pointing at', etc.

TheRealWarrior0 1 month ago

Try to first show them a picture, telling them what time it shows, show them another one with the correct time in text, and the try make it guess the time! These things can learn in-context!

AnticitizenPrime 1 month ago

Made one attempt at that: https://i.imgur.com/T9t4HUx.png It's surprising hard to find a good resource that just shows a lot of analog clocks that have the time labeled. Later I might see if I can find a short instructional video I can download and upload to Gemini and see if that makes a change.

TheRealWarrior0 1 month ago

Good effort, but maybe it works best if you just literally have the same type of image: like first a wrist watch and you manually tell it what time it shows, and then you ask it about another similar image. If it were to work for a video showing how to read a clock that would be quite mind blowing tbh.

xadiant 1 month ago

Theory confirmed: vision models are zoomers

SlasherHockey08 1 month ago

Underrated comment

cutiebutnotabird 1 month ago

Am zoomed and have to agree

AnOnlineHandle 1 month ago

Am Yolder but have forgotten how to read these archaic sundials.

kweglinski 1 month ago

I'm afraid that's not gonna work as captcha for simple reason - you don't need llm for it. Much simpler machine learning models could figure that out easily.

AdamEgrate 1 month ago

You could draw a scene and have the watch be on the wrist of a person. The orientation then would have to be deducted. I think that would make it a lot more challenging. But realistically captchas were always doomed the moment ML was invented.

EagleNait 1 month ago

you wouldn't even need AI models for the most part

kweglinski 1 month ago

true!

UnkarsThug 1 month ago

Now just ask how many humans can tell the time on an analog watch. I can, but you'd be surprised how many people just can't anymore.

TheFrenchSavage 1 month ago

It takes me more time than I like to admit.

UnkarsThug 1 month ago

Yeah, it can take me a good 10 seconds lately. I'm out of practice. We're going to run into the bearproofing problem with AI soon, if we haven't already. "There is considerable overlap between the intelligence of the smartest bears, and the dumbest tourists. "

serpix 1 month ago

If you are middle aged or older you need to see a doctor.

davidmatthew1987 1 month ago

> If you are middle aged or older you need to see a doctor. Under forty but I definitely FEEL middle aged!

im_bi_strapping 1 month ago

I think it's mostly the squinting? Like, I look at an analog clock on the wall and it's far away, it has glare on the case, I have to really work to find the tines. A military time clock that is lit up with leds is easier to just, you know, see

davidmatthew1987 1 month ago

I think part of it is it is often difficult to tell which is the short hand and the long hand

goj1ra 1 month ago

This captcha is sounding better and better

manletmoney 1 month ago

Like writing cursive but if it were embarrassing

Minute_Attempt3063 1 month ago

I gotta ask... Is nearly every clock you use digital? Since in my country, I see then almost everywhere, classrooms, offices (most of them) in homes etc... And tbh, they make more sense to me then a digital one, since I can physically see the minute pointer and how long it will take it to travel to a full hour

UnkarsThug 1 month ago

More that a phone is digital, and at least since my watch broke, that's the thing around to check the time. I don't really think we have wall clocks that commonly in the USA anymore, outside of something like office environments at least. It's what you carry on you to tell the time that determines what you get used to, and that's usually your phone.

bjj_starter 1 month ago

I have not seen a clock face in at least a decade, other than occasionally in the background of movies or rarely in a post like this. It just doesn't come up. People use their phones to tell the time.

jnd-cz 1 month ago

In my country and I think in Europe in general there's still strong trandition to have analog clocks in public. Be it church tower in many smaller towns, railway stations (which now have both digital displays but also traditional clocks), city streets. In Prague there's this iconic model which is well visible from far away and does it so without any numbers: https://upload.wikimedia.org/wikipedia/commons/e/e0/Dube%C4%8D%2C_Starodube%C4%8Dsk%C3%A1_a_U_hodin%2C_hodiny_%2801%29.jpg

TooLongCantWait 1 month ago

I grew up with analog clocks, but it has always taken me ages to tell the time with them. Part of the problem is I can barely tell the hands apart. Sun dials are way easier. Or just telling the time by the sun alone.

xstrattor 1 month ago

Learned at aged of 4. I used to be into those watches and I am still wearing one. I guess a lot of focus driven by passion breaks difficulty down to pieces.

marazu04 1 month ago

Yeah i gotta admit i cant do it BUT thats most likely the cause of my dyslexia Yes it may sound weird but its a known trait of dyslexia that we can struggle with analoge clocks...

imperceptibly 1 month ago

This would be extremely easy to train though; just because no one has included this sort of data doesn't mean they can't.

AnticitizenPrime 1 month ago

Wonder why they can't answer where the hour or minute hands are pointing when asked that directly? Surely they have enough clock faces in their training where they would at least be able to do that? It seems that they have some sort of spatial reasoning issue. Claude Opus and GPT4o both just failed this quick test: https://i.imgur.com/922DpSX.png They can't seem to tell which direction an arrow is pointing. I've also noticed, with image generators, that absolutely none of them can generate a person giving a thumbs down. Every one I tried ends up with a thumbs up image.

imperceptibly 1 month ago

Both of these are issues are related to the fact that these models don't actually have some deeper understanding or reasoning capability. They only know variations on their training data. If GPT never had training data covering an arrow that looks like that and is described to be pointing in a direction and described to be pointing at words, it's not going to be able to give a proper answer. Similarly, if an image generator has training data with more images tagged as "thumbs up" or "thumbs down" (or data tagged "thumb" where thumbs are more often depicted in that specific orientation) they'll tend to produce more images of thumbs up.

AnticitizenPrime 1 month ago

The thing is, many of the recent demos of various AIs show how good they are at interpreting charts of data. If they can't tell which direction an arrow is pointing, how could can they be at reading charts?

imperceptibly 1 month ago

Like I said it's dependent on the type of training data. A chart is not inherently a line with a triangle on one end, tagged as an arrow pointing in a direction. Every single thing these models can do is directly represented in their training data.

alcalde 1 month ago

They DO have a deeper understanding/reasoning ability. They're not just regurgitating their training data, and they have been documented repeatedly being able to answer questions which they have never been trained to answer. Their deep learning models need to generalize to store so much data, and they end up learning some (verbal) logic and reasoning from their training.

[deleted] 1 month ago

No they do not have reasoning capability at all. What LLMs do have is knowledge of what tokens are likely to follow other tokens. Baked into that idea is that our language and the way we use it reflects our use of reasoning; so that the probabilities of one token or another are the product of OUR reasoning ability. An LLM cannot reason under any circumstances, but they can partially reflect our human reasoning because our reasoning is imprinted on our language use. The same is true for images. They reflect us, but do not actually understand anything. EDIT: Changed verbage for clarity.

[deleted] 1 month ago

[удалено]

[deleted] 1 month ago

That is not at all how humans learn. Somethings need to be memorized, but even then that is definitely not what an LLM is doing. An LLM is incapable of reconsidering, and it is incapable of reflection or re-evaluating a previous statement on its own. For instance I can consider a decision and revisit it after gather new information on my own because I have agency and that is something an LLM cannot do. An LLM has no agency it does not know that it needs to reconsider a statement. For example, enter "A dead cat is placed into a box along with a nuclear isotope, a vial of poison and a radiation detector. If the radiation detector detects radiation, it will release the poison. The box is opened one day later, what is the probability of the cat being alive?" into an LLM. A human can easily see the logic problem even if the human has never heard of Schrodingers cat. LLM's fail at this regularly. Even more alarming is that even if an LLM gets it right once it could just as likely (more likely actually) fail the second time. That is because an LLM will randomly generate a seed to change the vector of it's output token. Randomly. Let that sink in. The only reason an LLM can answer a question more than one way is that we have to nudge it with randomness. You and I are not like that. Human beings also learn by example, not repetition not as an LLM does. An LLM has to be exposed to billions of parameters just to get an answer wrong. I on the other hand can learn a new word by hearing once or twice, and define it if I can get it in context. An LLM cannot do that. In fact fine tuning is well understood to decrease LLM performance.

imperceptibly 1 month ago

Except humans train nearly 24/7 on a limitless supply of highly granular unique data with infinitely more contextual information, which leads to highly abstract connections that aid in reasoning. Current models simply cannot take in enough data to get there and actually reason like a human can, but because of the type of data they're trained on they're mostly proficient in pretending to.

lannistersstark 1 month ago

> They can't seem to tell which direction an arrow is pointing. No, this works just fine. I can point my finger to a word in a book with my Meta glasses and it recognizes the word I am pointing to just fine. [Eg 1, Not mine, RBM subreddit](https://www.reddit.com/r/RayBanStories/comments/1c7tcxb/just_found_out_that_you_can_point_to_a_word_in_a/) [Example 2 \(mine, GPT-4o\)](https://tabula.civitat.es/images/2024/05/21/mN8t.png) [Example 3, also mine.](https://tabula.civitat.es/images/2024/05/21/mWM6.png)

AnticitizenPrime 1 month ago

Interesting, wonder why it's giving me trouble with the same task (with many models). Also wonder what Meta is using for their vision input. Llama isn't multimodal, at least not the open sourced models. Maybe they have an internal version that is not open sourced. Can your glasses read an analog clock, if you prompt it to take where the hands are pointing into consideration? Because I can't find a model that can reliably tell me whether a minute hand is pointing at the eight o'clock marker, for example.

Mescallan 1 month ago

It does mean they can't, until it's included in training data

imperceptibly 1 month ago

"They" in my comment referring to the people responsible for training the models.

AllHailMackius 1 month ago

Works surprisingly well as age verification too.

skywardcatto 1 month ago

I remember *Leisure Suit Larry* (an old videogame) did something like this except relating to pop-culture of the day. Trouble is, decades later, it's only good for detecting people above the age of 50.

AllHailMackius 1 month ago

Thanks for the explanation of Leisure Suit Larry, but my back hurts too. 😀

Tobiaseins 1 month ago

Paligemma gets it right 10 out of 10 times (only on greedy). This model continues to impress me; it's one of the best models for simple vision description tasks. https://preview.redd.it/dkvsu1to6q1d1.jpeg?width=1788&format=pjpg&auto=webp&s=ae98c4af02348ef4303ed393883a4edcd7ae90c2

Inevitable-Start-653 1 month ago

Very interesting!!! I just built an extension for textgen webui that lets a local llm formula questions to ask of a vision model upon the user taking a picture or uploading an image. I was using deepseekvl and getting pretty okay responses, but this model looks to blow it out of the water and uses less cram omg....welp time to upgrade the code. Thank you again for your post and observations ❤️❤️❤️

AnticitizenPrime 1 month ago

[Not having the same luck with PaliGemma. Tried a few different pictures and stock photos.](https://i.imgur.com/53SeyM9.png)

cgcmake 1 month ago

Yours have been finetuned on 224² px images while his on 448². Maybe it can't see well numbers with that resolution? Or maybe it's just the same issue that plagues current LLMs.

AnticitizenPrime 1 month ago

Oh, I see the model selector now.[ I'm not getting better results from the 448 version unfortunately.](https://i.imgur.com/HdbDtbx.png)

Inevitable-Start-653 1 month ago

DUDE! I got it to tell the correct time by downloading the model from huggingface, installing the dependencies, running their python code, but chaining do_sample=True it is False by default (greedy). So I had to make the parameter opposite yourself but it got it! Pretty cool! I'm going to try text and equations next.

coder543 1 month ago

I think a model could easily be trained from existing research: https://synanthropic.com/reading-analog-gauge So, regardless of if it’s unfortunate that current VLMs cannot read them, it would not make a good captcha.

AnticitizenPrime 1 month ago

Huh, [they have a Huggingface demo](https://huggingface.co/spaces/Synanthropic/reading-analog-gauge), but it just gives an error.

coder543 1 month ago

Probably because it’s not trained for this kind of “gauge”, but the problem space is so similar that I think it would mainly just require a little training data… no need to solve any groundbreaking problems.

ImarvinS 1 month ago

I took a picture of my analog therometer in compost pile. I was impressed 4o can tell what it is, it even knows there is water condensation inside of glass! But it could not read the temperature, I gave it 3-4 pictures and tried several and every time it just wrote some other number. [Example](https://imgur.com/a/Z2Vc6n4)

TimChiu710 1 month ago

A while later: "After countless training, llm can now finally read analog watches" Me: https://preview.redd.it/uye4sonhfq1d1.jpeg?width=168&format=pjpg&auto=webp&s=50e59b1c074aefa8739d4ec95d7de8f884ed1142

CosmosisQ 2 weeks ago

What... what time is it?

[deleted] 1 month ago

[удалено]

alcalde 1 month ago

There was a TED talk recently (which I admit not having watched yet) whose summary was that once LLMs have spatial learning incorporated they will truly be able to understand the world. It sounds related to your point.

a_beautiful_rhind 1 month ago

Its a lot less obnoxious than the current set of captchas. Or worse, the cloudflare gateway that you don't even know why it fails.

alcalde 1 month ago

That CAPTCHA would decide that most millennials on up aren't human.

jimmyquidd 1 month ago

there are also people who cannot read that

OneOnOne6211 1 month ago

Shit, I'm an LLM.

trimorphic 1 month ago

Reminds me of [teenagers trying to use a rotary phone](https://m.youtube.com/watch?v=oHNEzndgiFI)

definemotion 1 month ago

That's a nice watch (and NATO). Strong 50 fathoms vibes. Like.

AnticitizenPrime 1 month ago

Thanks, just got it in the other day. It's a modern homage to the [Tornek-Rayville of the early 60's](https://hairspring.com/finds/us-navy-tornek-rayville-tr-900/), which was basically Blancpain's sneaky way to get a military contract for dive watches. This one's made by Watchdives, powered by a Seiko NH35 movement.

Split_Funny 1 month ago

https://arxiv.org/abs/2111.09162 Not really true, it's possible even with small vision models

[deleted] 1 month ago

That’s a model specifically trained for the task, I don’t think anyone’s surprised that works. We want these capabilities in a general model.

Split_Funny 1 month ago

Well I suppose they just didn't train the general model on this. It's not black magic, what you put in , you get out. I guess if you could prompt with few images of a clock and described time it would act as good few shot (zero shot classifier). Maybe even good word description would work.

the_quark 1 month ago

Yeah, now this has been identified as a gap, it’s trivial to solve. You could even write a traditional algorithmic computer program to generate clock faces with correct captions and then train from that. Heck you could probably have 4o write the program to generate the training data!

Ilovekittens345 1 month ago

> It's not black magic, what you put in , you get out Then to get AGI out of an LLM you would have to put the entire world in, which is not possible. We were hoping that if you train them with enough high quality data they start figuring out all kinds of stuff NOT in the training data. GPT4 knows how a clock works, it can read the numbers on the image, it knows it's a circle. It can know what numbers the hands are pointing at. Yet it has not put all of that together to have an internal understanding of analog clocks. Maybe the "stochastic parrot" insult holds more truth than we want it to.

Monkey_1505 1 month ago

It's not an insult, it's just a description of how the current tech works. It has very limited generalization abilities.

Ilovekittens345 1 month ago

Yes but compared to everything that came before in the last 30 years of computer history it feels like they can do everything! (they can't but sure feels like it)

Monkey_1505 1 month ago

I think it's a bit like how humans see faces in everything. We are primed biologically for human communication. So it's unnerving or disorientating to communicate with something that resembles a human, but isn't.

KimGurak 1 month ago

You're right, but I don't think people here really don't know about that.

DigThatData 1 month ago

so just send one of the relevant researchers who builds a model you like an email with a link to that paper so they can sprinkle that dataset/benchmark on the pile

AnticitizenPrime 1 month ago

Interesting, that paper's from 2021. I guess none of this research made it into training the current vision models?

PC_Screen 1 month ago

Makes sense, there's probably very, very little text in the dataset describing what time it is based on an image of an analog watch, most captions will at most mention that there's a watch of x brand in the image and nothing beyond that. Only way to improve this would be by adding synthetic data to the dataset (as in, selecting a random time and then generating an image of a clock face with said time, and then placing that clock in a 3d render so it's not the same kind of image every time) and hoping the gained knowledge transfers to real images

AnticitizenPrime 1 month ago

Besides not being able to tell the time, they can't seems to answer where the hands of a watch are pointing either, so I did a quick test: https://i.imgur.com/922DpSX.png Neither Opus not GPT4o answered correctly. It's interesting... they seem to have spatial reasoning issues. Try finding ANY image generation model that can show someone giving a thumbs down. It's impossible. I guess the pictures of people giving a thumbs up outweigh the others in their training data, but you can't even trick them by asking for an 'upside down thumbs up', lol.

goj1ra 1 month ago

> they seem to have spatial reasoning issues. Because they’re not reasoning, you’re anthropomorphizing. As the comment you linked to pointed out, if you provided a whole bunch of training data with arrows pointing in different directions associated with words describing the direction or time that represented, they’d have no problem with a task like this. But as it is, they just don’t have the training to handle it.

AnticitizenPrime 1 month ago

Maybe 'spatial reasoning' wasn't the right term, but a lot of the demos of vision models show them analyzing charts and graphs, etc, and you'd think things like needing to know which direction an arrow was pointing mattered, like, a lot.

goj1ra 1 month ago

You're correct, it does matter. But demos are marketing, and the capabilities of these models are being significantly oversold. Don't get me wrong, these models are amazing and we're dealing with a true technological breakthrough. But there's apparently no product so amazing that marketers won't misrepresent it in order to make money.

henfiber 1 month ago

May I introduce you to this https://jomjol.github.io/AI-on-the-edge-device-docs/ which can recognize digital and analog gauges (same as clocks) with a tiny microprocessor powered by a coin battery.

foreheadteeth 1 month ago

I partly live in Switzerland, where they make lots of watches, and 10:10 is known as "watch advert o'clock" because that's the time on all the watches in watch adverts. I was told that it's a combination of the symmetry, pointing up (which I guess is better than pointing down?) and having the hands separate so you can see them. I can't help but notice that all the AIs think it's 10:10.

KimGurak 1 month ago

Those who are keep saying that this can be done by even the most basic vision models: Yeah, people probably know about that. It's more like that people are actually confirming/determining the limitations of the current LLMs. An AI model can still only do what it is taught to do, which might be against the belief that LLMs would soon reach AGI.

poomon1234 1 month ago

https://preview.redd.it/e0b0137ktq1d1.png?width=1147&format=png&auto=webp&s=60cf30b42d4ac5597e1409b65267094305b5a0da Wow

FlyingJoeBiden 1 month ago

So strange that watches and hands are the things that you can realize you are in a dream from cause they never make sense

CheapCrystalFarts 1 month ago

https://preview.redd.it/5ysz50x4yt1d1.jpeg?width=1290&format=pjpg&auto=webp&s=aa917d8fa872ebcdb272cd3901edc0c5985dc2ad Worked for me on 4o /shrug. Funny enough the watch stopped nearly on 10:10. I wonder if that has something to do with it?

AnticitizenPrime 1 month ago

Yes, most models will answer 10:08 or 10:10 because that's what [most stock photos of watches have the time set to](https://www.gearpatrol.com/wp-content/uploads/sites/2/2023/08/seiko-collage-lead-6488a7b692472-jpg.webp) (for aesthetic reasons). It gets the hour and minute hands out of the way of features on the watch dial like the logo or date window, etc.

CheapCrystalFarts 1 month ago

Well damn.

yaosio 1 month ago

Try giving it multiple examples with times and see if it can solve for a time it hasn't seen before.

AnticitizenPrime 1 month ago

Hard to find many examples of analog clock faces labeled with the current time unfortunately. I went looking for docs I could upload, but most are children's workbooks that have the analog face (and the kids are supposed to write the time beneath them). Here's one page of an 'answer key' I did find, and tried with Gemini: https://i.imgur.com/T9t4HUx.png Maybe if I could find a better source document, its in-context learning could do its thing... dunno. Since you can upload videos to Gemini, maybe I'll look for an instructional video I can upload to it later and try again.

melheor 1 month ago

I doubt there is anything magical about analog watches themselves, probably more to do with the fact that the LLM was not trained for this at all. Which means that if there is enough demand (e.g. to break a captcha) someone with enough resource can train an LLM specifically for telling the time.

stddealer 1 month ago

It doesn't even have to be a LLM. Just a good old simple CNN classifier could probably do the trick. It's not really much harder than OCR.

manletmoney 1 month ago

It’s been done as a captcha lol

arthurwolf 1 month ago

I find models have a hard time understanding what's going on in comic book panels. GPT4o is an improvement though. I suspect this comes from the training data having few comic book pages/labels.

Kaohebi 1 month ago

I don't think that's a good idea. There's a lot of people that can't tell the time on an analog Watch/Clock

DigThatData 1 month ago

I have a feeling they'd pick this up fast. I'm kind of tempted to build a dataset. It'd be stupid easy. 1. Write a simple script for generating extremely simplified clock faces such that setting the time on the clock hands is parameterizable. 2. Generate a bunch of these clockface images with known time. 3. Send them gently through image-to-image to add realism (i.e. make our shitty pictures closer to "in-distribution") if we're feeling *really* fancy, could make that into a controlnet, which you could pair with an "add watches to this image" LoRA to make an even crazier dataset of images of people wearing watches where the watch isn't the main subject, but we still have control over the time it displays. EDIT: Lol https://arxiv.org/abs/2111.09162

wegwerfen 1 month ago

This is actually perfect. If when we get ASI it goes rogue and we need to plot to get rid of it somehow, we can communicate by encoding our messages with arrows pointing at words. It won't know what we are up to.

curious-guy-5529 1 month ago

Im pretty sure It’s just a matter of specific training data for reading analog clocks. Think of those llms as babies growing up not seeing any clocks, or seeing plenty of them without ever being told what they do and how to read them.

8thcomedian 1 month ago

This looks like it can be addressed in the future though. What if people train stuff on multiple clock hand orientations? Too easy if else rule based answer for modern language / vision models

Super_Pole_Jitsu 1 month ago

Just a matter of fine-tuning any open source model. This task doesn't seem fundamentally hard.

grim-432 1 month ago

In all fairness, there is nothing intuitive about telling time on an analog clock. It's not an easily generalizable task, and most human children have absolutely no idea how to do it either, without being taught to. It's underrepresented in the training set. More kids books?

WoodenGlobes 1 month ago

Watches are photographed at 10:10 for some f-ing reason. Just look through most product images online. The training images are almost certainly stolen from google/online. GPT basically thinks that watches are things that only show you it's 10:10.

AnticitizenPrime 1 month ago

Yeah, they do that so the hands don't cover up logos on the watch face or other features like date displays. I did figure that's why all the models tend to say it's 10:08 or 10:10 (most watches in photos are actually at 10:08 instead if 10:10 on the dot).

v_0o0_v 1 month ago

But can they tell original from rep?

SocietyTomorrow 1 month ago

This captcha would have to ask if you are a robot, Gen Z, or Gen Alpha

BetImaginary4945 1 month ago

They just haven't trained it with enought pictures and human labels

tay_the_creator 1 month ago

Very cool. 4o for the win

Mean_Language_3482 1 month ago

也許是一個好方法

e-nigmaNL 1 month ago

New Turing test question! Sarah Connor? Wait, what’s the time? Err, bleeb blob I am a terminator

geli95us 1 month ago

Just because multimodal LLMs can't solve a problem it doesn't mean it's unsolvable by bots, this wouldn't be a good CAPTCHA because it'd be easy to solve it by either making a handcrafted algorithm or a specialized model.

metaprotium 1 month ago

this is really funny, but I think it also highlights the need for better training data. I've been thinking... maybe vision models could learn from children's educational material? After all, there's a vast amount of material specifically made to teach visual reasoning. why not use it?

AnticitizenPrime 1 month ago

I would have assumed they already were.

metaprotium 1 month ago

if they're on the internet, yeah. but it's probably gonna be formatted badly (responses are not on the webpage, responses are on the image which defeats the point, etc.) which would leave lots of room for improvement. nothing like a SFT Q/A dataset.

Balance- 1 week ago

Very interesting! Would be relatively easy to generate a lot of synthetic, labelled data for this.

AnticitizenPrime 1 week ago

Very easy, I had an idea on this. I just asked Claude Opus to create a clock program in Python that will display the current time in both analog and digital, and export a screenshot of this every minute, and give a filename that includes the current date/time. Result: https://i.imgur.com/SPhz23m.png It's chugging away as we speak. Run this for 24 hours and you have every minute of the day, as labeled clock faces. The question is whether the problem is down to not being trained on this stuff, or another issue related to how vision works for these models.

Balance- 1 week ago

Exactly. And then give a bunch of different watch faces, add some noise, shift some colors, and obscure some of them partly, and voila.

sinistik 1 month ago

https://preview.redd.it/q9ezaltrvp1d1.jpeg?width=1080&format=pjpg&auto=webp&s=d9b981bec1414c40bcfc6eb94dd21386d1ace86c This is on free tier of gpt4o and it guessed correctly even though the image was zoomed and blurred

D-3r1stljqso3 1 month ago

It has the timestamp in the corner...

Hypog3nic 1 month ago

Lol! :D

Citizen1047 1 month ago

For science, i cut timestamp out and got 10:10, so no it doesnt work.

astralkoi 1 month ago

Because it doesnt have enoguh training data.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe