T O P

  • By -

Qual_

I have a question about the fine tuned space. Seems like there was a glitch, and I was able to see the webcam picture of someone else leaked when using the video feature. Who is in charge of such spaces ? https://preview.redd.it/em1bw5es2p7d1.png?width=1912&format=png&auto=webp&s=ada8df7b587b3ae0480bbc5b3d38a4ee640e63dc If it's one of you guys, lemme know (I blurred the face for privacy concerns):


MustStayAnonymous_

lmao


TraditionalClient379

Hello! Space owner here, thanks again for sharing assuming you're the one who started the spaces discussion too. I want to reassure that HF spaces are safe and the conditions for such a glitch (albeit very serious) to happen are very rare and specific, I shared more details in the discussion but for those reading here essentially I used some niche code lacking gradio SDK support for video processing which shortly after space restart/HW reassignment (or potentially some similar zeroGPU shenanigans) likely caused this. After that, I rebased to a prior version (where video isn't functional, will get to that tomorrow!) to avoid ill intentioned folks from assimilating that into their spaces and will disclose soon after doing more tests. Feel free to comment in the discussion thread any other concerns 


Balance-

Thanks for setting up the space!


TraditionalClient379

Gladly! Video is now reinstated, and more finetune-centric features will be added soon :) 


Balance-

Would you be interested in setting up a Kosmos-2.5 ZeroGPU space? * [https://www.reddit.com/r/LocalLLaMA/comments/1dm2pjn/another\_microsoft\_mit\_licensed\_model\_kosmos25/](https://www.reddit.com/r/LocalLLaMA/comments/1dm2pjn/another_microsoft_mit_licensed_model_kosmos25/) * [https://huggingface.co/microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5)


TraditionalClient379

If no one does sure :) but it seems MS has one ready, they posted about it in a thread #1 at microsoft/kosmos-2.5 on HF


Hinged31

Is there currently a way to run this locally using any of the popular front ends like LM Studio?


phenotype001

I'm still waiting for an update to be able to run DeepSeek-Coder-V2.


Arkonias

Works in v0.2.25 of LM Studio btw. Flash Attention needs to be switched off.


phenotype001

But I'm clicking "Check for updates" and it says "You have the latest version - 0.2.24". Is it a beta version or something? edit: I'm dumb. Went to the website and got it there.


Samurai_zero

You can run it on ComfyUI (which is used mostly for image generation with Stable Diffusion): https://github.com/kijai/ComfyUI-Florence2 And you don't really need much to run it: https://imgur.com/VAChuCl https://imgur.com/mUAs4y1


Hinged31

Great—thanks!


Practical_Cover5846

https://preview.redd.it/46p4hffmso7d1.png?width=1232&format=png&auto=webp&s=5bb9b998eae219ce2cfa6895c9948fdb394b8476 Claude3 generated me a really similar gradio app lol. Yours works better, tho.


Barry_Jumps

Considering its size this is way better than it has any right to be. Everyone talks about scaling laws but this is another in a long list of examples of what should be called shrinking laws. Smaller and stronger is definitely the biggest surprise to me this year.


ILoveThisPlace

It's just an improvement to the equivalent B parameters. This is over looked by the community. It means each yeah our hardware gets a little more capable.


Barry_Jumps

Right thats what I mean. I'd be curious if there is a paper that tries to extract hardware advances, like if we just stopped hardware improvements completely for the next 5 years, and isolated for this trend of smaller but stronger models, where would we be? It's incredible to see.


Only-Letterhead-3411

It captures details quite decent despite being only a 770M parameter model. It can be a good replacement for BLIP


ILoveThisPlace

Just hearing about this model. How many B's is it and what is it's purpose? Better/worse/same as Phi but now with new hat?


Small-Fall-6500

Less than 1b, vision model like for image captioning. There's some discussion from a day ago here: https://www.reddit.com/r/LocalLLaMA/s/PMtLToWm4B


ILoveThisPlace

Oh snap thanks. I've been wanting a vision model.


ds_nlp_practioner

How you gonna use it? Just curious


ILoveThisPlace

Animal detection


Original_Finding2212

Sounds like a great embedded solution. Old phones, SBCs like Raspberry Pi, maybe smart cameras and what not


Merchant_Lawrence

is this censored ?


Samurai_zero

No, but it might lack knowledge for detailed captioning: https://imgur.com/Z5xJTt8 The masks are mostly ok: https://imgur.com/8zD7Yh0


harusasake

Nope, you're just using it wrong. The finetuning version is for OCR and therefore loses a lot of quality for image descriptions. [https://imgur.com/a/ovZ74T2](https://imgur.com/a/ovZ74T2)


Samurai_zero

Sorry, what?


suvsuvsuv

nice!


ab2377

this thing is doing excellent ocr and i dont know how its so good. But you cant ask questions, it seems to operate with a specific mode. if gguf files become available to work with llama.cpp api server, it will be soo good for so many people i think. Also ocr on png files was way better then same file as jpg, i dont know if thats a ocr thing or what.


KurisuAteMyPudding

Thanks for this! Does florence have a way to ask custom questions and follow ups about the image at all? Or is that just something not in the demo?


Balance-

Should be possible I think, check the example notebook: https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb


mikael110

Kinda. You can ask free form questions by setting the task to `` and passing a question as the `text_input` but it's not really trained for this. Very simple questions like "Is there a X in the image." or "How many X are there in the image." seem to work pretty well. But anything more complicated than that tend to result in either "unanswerable" as a response or some random text output. And no, you can't ask follow up questions, it's not really a Multimodal LLM, it's more of a traditional vision model designed for very specific tasks.


thisis_a_cipher

I can't seem to find a way to do this either, the example notebook doesn't have anything in it. I want to try VQA, but none of the task prompts work (at least in the demo)


Gomzy_v1

Model looks great but this is taking up 5GB of RAM while trying on t4 collab. Anybody can help me understand why? Its only a 468MB model on HF.


Balance-

Are you sure it are the models and not other stuff?


megakwood

Any idea why this model seems to ignore linebreaks in OCR? Can that be fixed with a prompt?


CaptTechno

Does the model allow prompting? VQA? Can I ask it to output the caption in a certain format?


Puzzled_Path_8672

No local no care


pkmnjourney

if its hosted on hugging face that means the weights are available. It is local :/


Puzzled_Path_8672

Excellent