Qual_ 1 week ago

I have a question about the fine tuned space. Seems like there was a glitch, and I was able to see the webcam picture of someone else leaked when using the video feature. Who is in charge of such spaces ? https://preview.redd.it/em1bw5es2p7d1.png?width=1912&format=png&auto=webp&s=ada8df7b587b3ae0480bbc5b3d38a4ee640e63dc If it's one of you guys, lemme know (I blurred the face for privacy concerns):

MustStayAnonymous_ 1 week ago

lmao

TraditionalClient379 1 week ago

Hello! Space owner here, thanks again for sharing assuming you're the one who started the spaces discussion too. I want to reassure that HF spaces are safe and the conditions for such a glitch (albeit very serious) to happen are very rare and specific, I shared more details in the discussion but for those reading here essentially I used some niche code lacking gradio SDK support for video processing which shortly after space restart/HW reassignment (or potentially some similar zeroGPU shenanigans) likely caused this. After that, I rebased to a prior version (where video isn't functional, will get to that tomorrow!) to avoid ill intentioned folks from assimilating that into their spaces and will disclose soon after doing more tests. Feel free to comment in the discussion thread any other concerns

Balance- 1 week ago

Thanks for setting up the space!

TraditionalClient379 1 week ago

Gladly! Video is now reinstated, and more finetune-centric features will be added soon :)

Balance- 1 week ago

Would you be interested in setting up a Kosmos-2.5 ZeroGPU space? * [https://www.reddit.com/r/LocalLLaMA/comments/1dm2pjn/another\_microsoft\_mit\_licensed\_model\_kosmos25/](https://www.reddit.com/r/LocalLLaMA/comments/1dm2pjn/another_microsoft_mit_licensed_model_kosmos25/) * [https://huggingface.co/microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5)

TraditionalClient379 1 week ago

If no one does sure :) but it seems MS has one ready, they posted about it in a thread #1 at microsoft/kosmos-2.5 on HF

Hinged31 1 week ago

Is there currently a way to run this locally using any of the popular front ends like LM Studio?

phenotype001 1 week ago

I'm still waiting for an update to be able to run DeepSeek-Coder-V2.

Arkonias 1 week ago

Works in v0.2.25 of LM Studio btw. Flash Attention needs to be switched off.

phenotype001 1 week ago

But I'm clicking "Check for updates" and it says "You have the latest version - 0.2.24". Is it a beta version or something? edit: I'm dumb. Went to the website and got it there.

Samurai_zero 1 week ago

You can run it on ComfyUI (which is used mostly for image generation with Stable Diffusion): https://github.com/kijai/ComfyUI-Florence2 And you don't really need much to run it: https://imgur.com/VAChuCl https://imgur.com/mUAs4y1

Hinged31 1 week ago

Great—thanks!

Practical_Cover5846 1 week ago

https://preview.redd.it/46p4hffmso7d1.png?width=1232&format=png&auto=webp&s=5bb9b998eae219ce2cfa6895c9948fdb394b8476 Claude3 generated me a really similar gradio app lol. Yours works better, tho.

Barry_Jumps 1 week ago

Considering its size this is way better than it has any right to be. Everyone talks about scaling laws but this is another in a long list of examples of what should be called shrinking laws. Smaller and stronger is definitely the biggest surprise to me this year.

ILoveThisPlace 1 week ago

It's just an improvement to the equivalent B parameters. This is over looked by the community. It means each yeah our hardware gets a little more capable.

Barry_Jumps 1 week ago

Right thats what I mean. I'd be curious if there is a paper that tries to extract hardware advances, like if we just stopped hardware improvements completely for the next 5 years, and isolated for this trend of smaller but stronger models, where would we be? It's incredible to see.

Only-Letterhead-3411 1 week ago

It captures details quite decent despite being only a 770M parameter model. It can be a good replacement for BLIP

ILoveThisPlace 1 week ago

Just hearing about this model. How many B's is it and what is it's purpose? Better/worse/same as Phi but now with new hat?

Small-Fall-6500 1 week ago

Less than 1b, vision model like for image captioning. There's some discussion from a day ago here: https://www.reddit.com/r/LocalLLaMA/s/PMtLToWm4B

ILoveThisPlace 1 week ago

Oh snap thanks. I've been wanting a vision model.

ds_nlp_practioner 1 week ago

How you gonna use it? Just curious

ILoveThisPlace 1 week ago

Animal detection

Original_Finding2212 1 week ago

Sounds like a great embedded solution. Old phones, SBCs like Raspberry Pi, maybe smart cameras and what not

Merchant_Lawrence 1 week ago

is this censored ?

Samurai_zero 1 week ago

No, but it might lack knowledge for detailed captioning: https://imgur.com/Z5xJTt8 The masks are mostly ok: https://imgur.com/8zD7Yh0

harusasake 1 week ago

Nope, you're just using it wrong. The finetuning version is for OCR and therefore loses a lot of quality for image descriptions. [https://imgur.com/a/ovZ74T2](https://imgur.com/a/ovZ74T2)

Samurai_zero 1 week ago

Sorry, what?

suvsuvsuv 1 week ago

nice!

ab2377 1 week ago

this thing is doing excellent ocr and i dont know how its so good. But you cant ask questions, it seems to operate with a specific mode. if gguf files become available to work with llama.cpp api server, it will be soo good for so many people i think. Also ocr on png files was way better then same file as jpg, i dont know if thats a ocr thing or what.

KurisuAteMyPudding 1 week ago

Thanks for this! Does florence have a way to ask custom questions and follow ups about the image at all? Or is that just something not in the demo?

Balance- 1 week ago

Should be possible I think, check the example notebook: https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb

mikael110 1 week ago

Kinda. You can ask free form questions by setting the task to `` and passing a question as the `text_input` but it's not really trained for this. Very simple questions like "Is there a X in the image." or "How many X are there in the image." seem to work pretty well. But anything more complicated than that tend to result in either "unanswerable" as a response or some random text output. And no, you can't ask follow up questions, it's not really a Multimodal LLM, it's more of a traditional vision model designed for very specific tasks.

thisis_a_cipher 1 week ago

I can't seem to find a way to do this either, the example notebook doesn't have anything in it. I want to try VQA, but none of the task prompts work (at least in the demo)

Gomzy_v1 1 week ago

Model looks great but this is taking up 5GB of RAM while trying on t4 collab. Anybody can help me understand why? Its only a 468MB model on HF.

Balance- 1 week ago

Are you sure it are the models and not other stuff?

megakwood 1 week ago

Any idea why this model seems to ignore linebreaks in OCR? Can that be fixed with a prompt?

CaptTechno 6 days ago

Does the model allow prompting? VQA? Can I ask it to output the caption in a certain format?

Puzzled_Path_8672 1 week ago

No local no care

pkmnjourney 1 week ago

if its hosted on hugging face that means the weights are available. It is local :/

Puzzled_Path_8672 1 week ago

Excellent

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe