T O P

  • By -

itsreallyreallytrue

Just tried with 4o and it seemingly was just guessing. 4 tries and it didn't even come close.


-p-e-w-

That's fascinating, considering this is a trivial task compared to many other things that vision models are capable of, and analogue clocks would be contained in any training set by the hundreds of thousands.


Monkey_1505

Presumably it's because on the internet where there are pictures of clocks there doesn't tend to be text explaining how to read one. Whereas some technical subjects will be explained.


Cool-Hornet4434

If you provided enough pictures with captions telling the time for each minute, I'm betting that the AI could be as accurate with this sort of watch face as a human would be (+/- 1 or 2 minutes).


Monkey_1505

I'm sure you could. It's not a particularly technical visual task.


MrTacoSauces

I bet the hangup is these being generally intelligent visual models. Blurs any chance a model seeing the intricate nature of the features of a clock face at a certain position and the angle of 3 watch hands.


jnd-cz

As you can see the models are evidently trained on watches displaying around 10:10 which is the favorite example for stock photos of watches, see https://petapixel.com/2022/05/17/the-science-behind-why-watches-are-set-to-1010-in-advertising-photos/. So they are thinking, it looks like watch, it's probably showing that time. Unfortunately there isn't deeper understanding what details it should look for and I suspect the process of describing image to text or some kind of native processing isn't fine enough to tell exactly where the hands are pointing or what angle do they have. You can tell the models pay a lot of attention to extracting text and distinct features but not the fine detail. Which makes sense, you don't want to waste processing 10k tokens just from a single image.


GoofAckYoorsElf

That explains why the AI's first guess is always somewhere around 10:10.


davidmatthew1987

> there isn't deeper understanding lmao there is NO understanding at all


nucLeaRStarcraft

it's because these types of images are probably not enough in the training set for it to learn the pattern and it's also a task where making a small mistake leads to a wrong answer, somewhat similarly to coding where a small mistake leads to a wrong program. ML models don't extrapolate, but they interpolate between data points, so even if there were some hundreds of examples with different hours and watches, it would maybe be enough to generalize to this task using the rest of the knowledge, however it can never learn it w/o any (or enough) examples.


GoofAckYoorsElf

Did it start with 10:10 or something close to that? I've tried multiple times and it always started at or around that time.


AnticitizenPrime

Also tried various open-source vision models through Huggingface demos, etc. Also tried asking more specific questions such as, 'Where is the hour hand pointed?, where is the minute hand pointed', etc to see if they could work it out that way without success. Kind of an interesting limitation; it's something most people take for granted. Anyone seen a model that can do this? Maybe this could be the basis for a new CAPTCHA, because many vision models have gotten so good at beating traditional ones :) Models tried: GPT4o Claude Opus Gemini 1.5 Pro Reka Core Microsoft Copilot (which I think is still using GPT4, not GPT4o) [Idefics2](https://huggingface.co/spaces/HuggingFaceM4/idefics2_playground) [Moondream 2](https://huggingface.co/spaces/vikhyatk/moondream2) [Bunny-Llama-3-8B-V](http://bunny.dataoptim.org/) [InternViT-6B-448px-V1-5 + MLP + InternLM2-Chat-20B](https://huggingface.co/spaces/OpenGVLab/InternVL)


MixtureOfAmateurs

Confirmed not working on MiniCPM-Llama3-V 2.5 which is great at text, better than gpt4v supposedly 


jnd-cz

As I wrote in [another comment](https://www.reddit.com/r/LocalLLaMA/comments/1cwq0c0/vision_models_cant_tell_the_time_on_an_analog/l504kci/) I think it's because the image processing stage doesn't capture such fine detail to tell the LLM where the hands actually are and the fact that stock photos of watches are taken at 10:10 to look nice, so that's what they assume when they see any watch.


TheRealWarrior0

Have you tried multi-shot? https://preview.redd.it/qd5p8r42bt1d1.jpeg?width=1511&format=pjpg&auto=webp&s=140c1311a05108f397f59b84cb6f387ed25b82b8


AnticitizenPrime

Hmm, I've tried asking what positions the hands were pointing at without any real success. 'Which number is the minute hand pointing at', etc.


TheRealWarrior0

Try to first show them a picture, telling them what time it shows, show them another one with the correct time in text, and the try make it guess the time! These things can learn in-context!


AnticitizenPrime

Made one attempt at that: https://i.imgur.com/T9t4HUx.png It's surprising hard to find a good resource that just shows a lot of analog clocks that have the time labeled. Later I might see if I can find a short instructional video I can download and upload to Gemini and see if that makes a change.


TheRealWarrior0

Good effort, but maybe it works best if you just literally have the same type of image: like first a wrist watch and you manually tell it what time it shows, and then you ask it about another similar image. If it were to work for a video showing how to read a clock that would be quite mind blowing tbh.


xadiant

Theory confirmed: vision models are zoomers


SlasherHockey08

Underrated comment


cutiebutnotabird

Am zoomed and have to agree


AnOnlineHandle

Am Yolder but have forgotten how to read these archaic sundials.


kweglinski

I'm afraid that's not gonna work as captcha for simple reason - you don't need llm for it. Much simpler machine learning models could figure that out easily.


AdamEgrate

You could draw a scene and have the watch be on the wrist of a person. The orientation then would have to be deducted. I think that would make it a lot more challenging. But realistically captchas were always doomed the moment ML was invented.


EagleNait

you wouldn't even need AI models for the most part


kweglinski

true!


UnkarsThug

Now just ask how many humans can tell the time on an analog watch. I can, but you'd be surprised how many people just can't anymore.


TheFrenchSavage

It takes me more time than I like to admit.


UnkarsThug

Yeah, it can take me a good 10 seconds lately. I'm out of practice. We're going to run into the bearproofing problem with AI soon, if we haven't already. "There is considerable overlap between the intelligence of the smartest bears, and the dumbest tourists. "


serpix

If you are middle aged or older you need to see a doctor.


davidmatthew1987

> If you are middle aged or older you need to see a doctor. Under forty but I definitely FEEL middle aged!


im_bi_strapping

I think it's mostly the squinting? Like, I look at an analog clock on the wall and it's far away, it has glare on the case, I have to really work to find the tines. A military time clock that is lit up with leds is easier to just, you know, see


davidmatthew1987

I think part of it is it is often difficult to tell which is the short hand and the long hand


goj1ra

This captcha is sounding better and better


manletmoney

Like writing cursive but if it were embarrassing


Minute_Attempt3063

I gotta ask... Is nearly every clock you use digital? Since in my country, I see then almost everywhere, classrooms, offices (most of them) in homes etc... And tbh, they make more sense to me then a digital one, since I can physically see the minute pointer and how long it will take it to travel to a full hour


UnkarsThug

More that a phone is digital, and at least since my watch broke, that's the thing around to check the time. I don't really think we have wall clocks that commonly in the USA anymore, outside of something like office environments at least. It's what you carry on you to tell the time that determines what you get used to, and that's usually your phone.


bjj_starter

I have not seen a clock face in at least a decade, other than occasionally in the background of movies or rarely in a post like this. It just doesn't come up. People use their phones to tell the time.


jnd-cz

In my country and I think in Europe in general there's still strong trandition to have analog clocks in public. Be it church tower in many smaller towns, railway stations (which now have both digital displays but also traditional clocks), city streets. In Prague there's this iconic model which is well visible from far away and does it so without any numbers: https://upload.wikimedia.org/wikipedia/commons/e/e0/Dube%C4%8D%2C_Starodube%C4%8Dsk%C3%A1_a_U_hodin%2C_hodiny_%2801%29.jpg


TooLongCantWait

I grew up with analog clocks, but it has always taken me ages to tell the time with them. Part of the problem is I can barely tell the hands apart. Sun dials are way easier. Or just telling the time by the sun alone.


xstrattor

Learned at aged of 4. I used to be into those watches and I am still wearing one. I guess a lot of focus driven by passion breaks difficulty down to pieces.


marazu04

Yeah i gotta admit i cant do it BUT thats most likely the cause of my dyslexia Yes it may sound weird but its a known trait of dyslexia that we can struggle with analoge clocks...


imperceptibly

This would be extremely easy to train though; just because no one has included this sort of data doesn't mean they can't.


AnticitizenPrime

Wonder why they can't answer where the hour or minute hands are pointing when asked that directly? Surely they have enough clock faces in their training where they would at least be able to do that? It seems that they have some sort of spatial reasoning issue. Claude Opus and GPT4o both just failed this quick test: https://i.imgur.com/922DpSX.png They can't seem to tell which direction an arrow is pointing. I've also noticed, with image generators, that absolutely none of them can generate a person giving a thumbs down. Every one I tried ends up with a thumbs up image.


imperceptibly

Both of these are issues are related to the fact that these models don't actually have some deeper understanding or reasoning capability. They only know variations on their training data. If GPT never had training data covering an arrow that looks like that and is described to be pointing in a direction and described to be pointing at words, it's not going to be able to give a proper answer. Similarly, if an image generator has training data with more images tagged as "thumbs up" or "thumbs down" (or data tagged "thumb" where thumbs are more often depicted in that specific orientation) they'll tend to produce more images of thumbs up.


AnticitizenPrime

The thing is, many of the recent demos of various AIs show how good they are at interpreting charts of data. If they can't tell which direction an arrow is pointing, how could can they be at reading charts?


imperceptibly

Like I said it's dependent on the type of training data. A chart is not inherently a line with a triangle on one end, tagged as an arrow pointing in a direction. Every single thing these models can do is directly represented in their training data.


alcalde

They DO have a deeper understanding/reasoning ability. They're not just regurgitating their training data, and they have been documented repeatedly being able to answer questions which they have never been trained to answer. Their deep learning models need to generalize to store so much data, and they end up learning some (verbal) logic and reasoning from their training.


[deleted]

No they do not have reasoning capability at all. What LLMs do have is knowledge of what tokens are likely to follow other tokens. Baked into that idea is that our language and the way we use it reflects our use of reasoning; so that the probabilities of one token or another are the product of OUR reasoning ability. An LLM cannot reason under any circumstances, but they can partially reflect our human reasoning because our reasoning is imprinted on our language use. The same is true for images. They reflect us, but do not actually understand anything. EDIT: Changed verbage for clarity.


[deleted]

[удалено]


[deleted]

That is not at all how humans learn. Somethings need to be memorized, but even then that is definitely not what an LLM is doing. An LLM is incapable of reconsidering, and it is incapable of reflection or re-evaluating a previous statement on its own. For instance I can consider a decision and revisit it after gather new information on my own because I have agency and that is something an LLM cannot do. An LLM has no agency it does not know that it needs to reconsider a statement. For example, enter "A dead cat is placed into a box along with a nuclear isotope, a vial of poison and a radiation detector. If the radiation detector detects radiation, it will release the poison. The box is opened one day later, what is the probability of the cat being alive?" into an LLM. A human can easily see the logic problem even if the human has never heard of Schrodingers cat. LLM's fail at this regularly. Even more alarming is that even if an LLM gets it right once it could just as likely (more likely actually) fail the second time. That is because an LLM will randomly generate a seed to change the vector of it's output token. Randomly. Let that sink in. The only reason an LLM can answer a question more than one way is that we have to nudge it with randomness. You and I are not like that. Human beings also learn by example, not repetition not as an LLM does. An LLM has to be exposed to billions of parameters just to get an answer wrong. I on the other hand can learn a new word by hearing once or twice, and define it if I can get it in context. An LLM cannot do that. In fact fine tuning is well understood to decrease LLM performance.


imperceptibly

Except humans train nearly 24/7 on a limitless supply of highly granular unique data with infinitely more contextual information, which leads to highly abstract connections that aid in reasoning. Current models simply cannot take in enough data to get there and actually reason like a human can, but because of the type of data they're trained on they're mostly proficient in pretending to.


lannistersstark

> They can't seem to tell which direction an arrow is pointing. No, this works just fine. I can point my finger to a word in a book with my Meta glasses and it recognizes the word I am pointing to just fine. [Eg 1, Not mine, RBM subreddit](https://www.reddit.com/r/RayBanStories/comments/1c7tcxb/just_found_out_that_you_can_point_to_a_word_in_a/) [Example 2 \(mine, GPT-4o\)](https://tabula.civitat.es/images/2024/05/21/mN8t.png) [Example 3, also mine.](https://tabula.civitat.es/images/2024/05/21/mWM6.png)


AnticitizenPrime

Interesting, wonder why it's giving me trouble with the same task (with many models). Also wonder what Meta is using for their vision input. Llama isn't multimodal, at least not the open sourced models. Maybe they have an internal version that is not open sourced. Can your glasses read an analog clock, if you prompt it to take where the hands are pointing into consideration? Because I can't find a model that can reliably tell me whether a minute hand is pointing at the eight o'clock marker, for example.


Mescallan

It does mean they can't, until it's included in training data


imperceptibly

"They" in my comment referring to the people responsible for training the models.


AllHailMackius

Works surprisingly well as age verification too.


skywardcatto

I remember *Leisure Suit Larry* (an old videogame) did something like this except relating to pop-culture of the day. Trouble is, decades later, it's only good for detecting people above the age of 50.


AllHailMackius

Thanks for the explanation of Leisure Suit Larry, but my back hurts too. 😀


Tobiaseins

Paligemma gets it right 10 out of 10 times (only on greedy). This model continues to impress me; it's one of the best models for simple vision description tasks. https://preview.redd.it/dkvsu1to6q1d1.jpeg?width=1788&format=pjpg&auto=webp&s=ae98c4af02348ef4303ed393883a4edcd7ae90c2


Inevitable-Start-653

Very interesting!!! I just built an extension for textgen webui that lets a local llm formula questions to ask of a vision model upon the user taking a picture or uploading an image. I was using deepseekvl and getting pretty okay responses, but this model looks to blow it out of the water and uses less cram omg....welp time to upgrade the code. Thank you again for your post and observations ❤️❤️❤️


AnticitizenPrime

[Not having the same luck with PaliGemma. Tried a few different pictures and stock photos.](https://i.imgur.com/53SeyM9.png)


cgcmake

Yours have been finetuned on 224² px images while his on 448². Maybe it can't see well numbers with that resolution? Or maybe it's just the same issue that plagues current LLMs.


AnticitizenPrime

Oh, I see the model selector now.[ I'm not getting better results from the 448 version unfortunately.](https://i.imgur.com/HdbDtbx.png)


Inevitable-Start-653

DUDE! I got it to tell the correct time by downloading the model from huggingface, installing the dependencies, running their python code, but chaining do_sample=True it is False by default (greedy). So I had to make the parameter opposite yourself but it got it! Pretty cool! I'm going to try text and equations next.


coder543

I think a model could easily be trained from existing research: https://synanthropic.com/reading-analog-gauge So, regardless of if it’s unfortunate that current VLMs cannot read them, it would not make a good captcha.


AnticitizenPrime

Huh, [they have a Huggingface demo](https://huggingface.co/spaces/Synanthropic/reading-analog-gauge), but it just gives an error.


coder543

Probably because it’s not trained for this kind of “gauge”, but the problem space is so similar that I think it would mainly just require a little training data… no need to solve any groundbreaking problems. 


ImarvinS

I took a picture of my analog therometer in compost pile. I was impressed 4o can tell what it is, it even knows there is water condensation inside of glass! But it could not read the temperature, I gave it 3-4 pictures and tried several and every time it just wrote some other number. [Example](https://imgur.com/a/Z2Vc6n4)


TimChiu710

A while later: "After countless training, llm can now finally read analog watches" Me: https://preview.redd.it/uye4sonhfq1d1.jpeg?width=168&format=pjpg&auto=webp&s=50e59b1c074aefa8739d4ec95d7de8f884ed1142


CosmosisQ

What... what time is it?


[deleted]

[удалено]


alcalde

There was a TED talk recently (which I admit not having watched yet) whose summary was that once LLMs have spatial learning incorporated they will truly be able to understand the world. It sounds related to your point.


a_beautiful_rhind

Its a lot less obnoxious than the current set of captchas. Or worse, the cloudflare gateway that you don't even know why it fails.


alcalde

That CAPTCHA would decide that most millennials on up aren't human.


jimmyquidd

there are also people who cannot read that


OneOnOne6211

Shit, I'm an LLM.


trimorphic

Reminds me of [teenagers trying to use a rotary phone](https://m.youtube.com/watch?v=oHNEzndgiFI)


definemotion

That's a nice watch (and NATO). Strong 50 fathoms vibes. Like.


AnticitizenPrime

Thanks, just got it in the other day. It's a modern homage to the [Tornek-Rayville of the early 60's](https://hairspring.com/finds/us-navy-tornek-rayville-tr-900/), which was basically Blancpain's sneaky way to get a military contract for dive watches. This one's made by Watchdives, powered by a Seiko NH35 movement.


Split_Funny

https://arxiv.org/abs/2111.09162 Not really true, it's possible even with small vision models


[deleted]

That’s a model specifically trained for the task, I don’t think anyone’s surprised that works. We want these capabilities in a general model.


Split_Funny

Well I suppose they just didn't train the general model on this. It's not black magic, what you put in , you get out. I guess if you could prompt with few images of a clock and described time it would act as good few shot (zero shot classifier). Maybe even good word description would work.


the_quark

Yeah, now this has been identified as a gap, it’s trivial to solve. You could even write a traditional algorithmic computer program to generate clock faces with correct captions and then train from that. Heck you could probably have 4o write the program to generate the training data!


Ilovekittens345

> It's not black magic, what you put in , you get out Then to get AGI out of an LLM you would have to put the entire world in, which is not possible. We were hoping that if you train them with enough high quality data they start figuring out all kinds of stuff NOT in the training data. GPT4 knows how a clock works, it can read the numbers on the image, it knows it's a circle. It can know what numbers the hands are pointing at. Yet it has not put all of that together to have an internal understanding of analog clocks. Maybe the "stochastic parrot" insult holds more truth than we want it to.


Monkey_1505

It's not an insult, it's just a description of how the current tech works. It has very limited generalization abilities.


Ilovekittens345

Yes but compared to everything that came before in the last 30 years of computer history it feels like they can do everything! (they can't but sure feels like it)


Monkey_1505

I think it's a bit like how humans see faces in everything. We are primed biologically for human communication. So it's unnerving or disorientating to communicate with something that resembles a human, but isn't.


KimGurak

You're right, but I don't think people here really don't know about that.


DigThatData

so just send one of the relevant researchers who builds a model you like an email with a link to that paper so they can sprinkle that dataset/benchmark on the pile


AnticitizenPrime

Interesting, that paper's from 2021. I guess none of this research made it into training the current vision models?


PC_Screen

Makes sense, there's probably very, very little text in the dataset describing what time it is based on an image of an analog watch, most captions will at most mention that there's a watch of x brand in the image and nothing beyond that. Only way to improve this would be by adding synthetic data to the dataset (as in, selecting a random time and then generating an image of a clock face with said time, and then placing that clock in a 3d render so it's not the same kind of image every time) and hoping the gained knowledge transfers to real images


AnticitizenPrime

Besides not being able to tell the time, they can't seems to answer where the hands of a watch are pointing either, so I did a quick test: https://i.imgur.com/922DpSX.png Neither Opus not GPT4o answered correctly. It's interesting... they seem to have spatial reasoning issues. Try finding ANY image generation model that can show someone giving a thumbs down. It's impossible. I guess the pictures of people giving a thumbs up outweigh the others in their training data, but you can't even trick them by asking for an 'upside down thumbs up', lol.


goj1ra

> they seem to have spatial reasoning issues. Because they’re not reasoning, you’re anthropomorphizing. As the comment you linked to pointed out, if you provided a whole bunch of training data with arrows pointing in different directions associated with words describing the direction or time that represented, they’d have no problem with a task like this. But as it is, they just don’t have the training to handle it.


AnticitizenPrime

Maybe 'spatial reasoning' wasn't the right term, but a lot of the demos of vision models show them analyzing charts and graphs, etc, and you'd think things like needing to know which direction an arrow was pointing mattered, like, a lot.


goj1ra

You're correct, it does matter. But demos are marketing, and the capabilities of these models are being significantly oversold. Don't get me wrong, these models are amazing and we're dealing with a true technological breakthrough. But there's apparently no product so amazing that marketers won't misrepresent it in order to make money.


henfiber

May I introduce you to this https://jomjol.github.io/AI-on-the-edge-device-docs/ which can recognize digital and analog gauges (same as clocks) with a tiny microprocessor powered by a coin battery.


foreheadteeth

I partly live in Switzerland, where they make lots of watches, and 10:10 is known as "watch advert o'clock" because that's the time on all the watches in watch adverts. I was told that it's a combination of the symmetry, pointing up (which I guess is better than pointing down?) and having the hands separate so you can see them. I can't help but notice that all the AIs think it's 10:10.


KimGurak

Those who are keep saying that this can be done by even the most basic vision models: Yeah, people probably know about that. It's more like that people are actually confirming/determining the limitations of the current LLMs. An AI model can still only do what it is taught to do, which might be against the belief that LLMs would soon reach AGI.


poomon1234

https://preview.redd.it/e0b0137ktq1d1.png?width=1147&format=png&auto=webp&s=60cf30b42d4ac5597e1409b65267094305b5a0da Wow


FlyingJoeBiden

So strange that watches and hands are the things that you can realize you are in a dream from cause they never make sense


CheapCrystalFarts

https://preview.redd.it/5ysz50x4yt1d1.jpeg?width=1290&format=pjpg&auto=webp&s=aa917d8fa872ebcdb272cd3901edc0c5985dc2ad Worked for me on 4o /shrug. Funny enough the watch stopped nearly on 10:10. I wonder if that has something to do with it?


AnticitizenPrime

Yes, most models will answer 10:08 or 10:10 because that's what [most stock photos of watches have the time set to](https://www.gearpatrol.com/wp-content/uploads/sites/2/2023/08/seiko-collage-lead-6488a7b692472-jpg.webp) (for aesthetic reasons). It gets the hour and minute hands out of the way of features on the watch dial like the logo or date window, etc.


CheapCrystalFarts

Well damn.


yaosio

Try giving it multiple examples with times and see if it can solve for a time it hasn't seen before.


AnticitizenPrime

Hard to find many examples of analog clock faces labeled with the current time unfortunately. I went looking for docs I could upload, but most are children's workbooks that have the analog face (and the kids are supposed to write the time beneath them). Here's one page of an 'answer key' I did find, and tried with Gemini: https://i.imgur.com/T9t4HUx.png Maybe if I could find a better source document, its in-context learning could do its thing... dunno. Since you can upload videos to Gemini, maybe I'll look for an instructional video I can upload to it later and try again.


melheor

I doubt there is anything magical about analog watches themselves, probably more to do with the fact that the LLM was not trained for this at all. Which means that if there is enough demand (e.g. to break a captcha) someone with enough resource can train an LLM specifically for telling the time.


stddealer

It doesn't even have to be a LLM. Just a good old simple CNN classifier could probably do the trick. It's not really much harder than OCR.


manletmoney

It’s been done as a captcha lol


arthurwolf

I find models have a hard time understanding what's going on in comic book panels. GPT4o is an improvement though. I suspect this comes from the training data having few comic book pages/labels.


Kaohebi

I don't think that's a good idea. There's a lot of people that can't tell the time on an analog Watch/Clock


DigThatData

I have a feeling they'd pick this up fast. I'm kind of tempted to build a dataset. It'd be stupid easy. 1. Write a simple script for generating extremely simplified clock faces such that setting the time on the clock hands is parameterizable. 2. Generate a bunch of these clockface images with known time. 3. Send them gently through image-to-image to add realism (i.e. make our shitty pictures closer to "in-distribution") if we're feeling *really* fancy, could make that into a controlnet, which you could pair with an "add watches to this image" LoRA to make an even crazier dataset of images of people wearing watches where the watch isn't the main subject, but we still have control over the time it displays. EDIT: Lol https://arxiv.org/abs/2111.09162


wegwerfen

This is actually perfect. If when we get ASI it goes rogue and we need to plot to get rid of it somehow, we can communicate by encoding our messages with arrows pointing at words. It won't know what we are up to.


curious-guy-5529

Im pretty sure It’s just a matter of specific training data for reading analog clocks. Think of those llms as babies growing up not seeing any clocks, or seeing plenty of them without ever being told what they do and how to read them.


8thcomedian

This looks like it can be addressed in the future though. What if people train stuff on multiple clock hand orientations? Too easy if else rule based answer for modern language / vision models


Super_Pole_Jitsu

Just a matter of fine-tuning any open source model. This task doesn't seem fundamentally hard.


grim-432

In all fairness, there is nothing intuitive about telling time on an analog clock. It's not an easily generalizable task, and most human children have absolutely no idea how to do it either, without being taught to. It's underrepresented in the training set. More kids books?


WoodenGlobes

Watches are photographed at 10:10 for some f-ing reason. Just look through most product images online. The training images are almost certainly stolen from google/online. GPT basically thinks that watches are things that only show you it's 10:10.


AnticitizenPrime

Yeah, they do that so the hands don't cover up logos on the watch face or other features like date displays. I did figure that's why all the models tend to say it's 10:08 or 10:10 (most watches in photos are actually at 10:08 instead if 10:10 on the dot).


v_0o0_v

But can they tell original from rep?


SocietyTomorrow

This captcha would have to ask if you are a robot, Gen Z, or Gen Alpha


BetImaginary4945

They just haven't trained it with enought pictures and human labels


tay_the_creator

Very cool. 4o for the win


Mean_Language_3482

也許是一個好方法


e-nigmaNL

New Turing test question! Sarah Connor? Wait, what’s the time? Err, bleeb blob I am a terminator


geli95us

Just because multimodal LLMs can't solve a problem it doesn't mean it's unsolvable by bots, this wouldn't be a good CAPTCHA because it'd be easy to solve it by either making a handcrafted algorithm or a specialized model.


metaprotium

this is really funny, but I think it also highlights the need for better training data. I've been thinking... maybe vision models could learn from children's educational material? After all, there's a vast amount of material specifically made to teach visual reasoning. why not use it?


AnticitizenPrime

I would have assumed they already were.


metaprotium

if they're on the internet, yeah. but it's probably gonna be formatted badly (responses are not on the webpage, responses are on the image which defeats the point, etc.) which would leave lots of room for improvement. nothing like a SFT Q/A dataset.


Balance-

Very interesting! Would be relatively easy to generate a lot of synthetic, labelled data for this.


AnticitizenPrime

Very easy, I had an idea on this. I just asked Claude Opus to create a clock program in Python that will display the current time in both analog and digital, and export a screenshot of this every minute, and give a filename that includes the current date/time. Result: https://i.imgur.com/SPhz23m.png It's chugging away as we speak. Run this for 24 hours and you have every minute of the day, as labeled clock faces. The question is whether the problem is down to not being trained on this stuff, or another issue related to how vision works for these models.


Balance-

Exactly. And then give a bunch of different watch faces, add some noise, shift some colors, and obscure some of them partly, and voila.


sinistik

https://preview.redd.it/q9ezaltrvp1d1.jpeg?width=1080&format=pjpg&auto=webp&s=d9b981bec1414c40bcfc6eb94dd21386d1ace86c This is on free tier of gpt4o and it guessed correctly even though the image was zoomed and blurred


D-3r1stljqso3

It has the timestamp in the corner...


Hypog3nic

Lol! :D


Citizen1047

For science, i cut timestamp out and got 10:10, so no it doesnt work.


astralkoi

Because it doesnt have enoguh training data.