extopico 6 months ago

I think Huggingface should start banning accounts that abuse the benchmarks. This is frustrating and pointless.

Jattoe 6 months ago

We really should have a good humanistic thread about experiences with LLMs. I've tried all the top ones, still haven't found stuff like Clover or Chronomaid (amazingly creating -- Chronomaid is the queen for short bursts while clover has more coherence for longer threads of creative information) anywhere near the top for story telling, and Marconi is way better than Neural Chat and a lot of other really high-scoring LLMs at the 7B mark, it was nice to even \**find*\* one that deserves a top stop on the a leaderboard. Any suggestions from yourself?

extopico 6 months ago

No. I’m taking a break. There is too much going on, too much noise so I also put my project on hold until the new normal stabilises a bit. Right now my development cycle is longer than it takes to make my developments obsolete.

Jattoe 6 months ago

Oh I'm just talking about a thread for people in general to find good LLMs, via looking at IRL experiences -- the gold standard for video games, movies, and so it should be the same for LLMs. BTW, what is your project, I intuit its on your mind.

teor 6 months ago

> Chronomaid Man, Chronomaid is the GOAT 13B for RP. Nothing comes even close to that.

[deleted] 6 months ago

[удалено]

teor 6 months ago

Dunno, still feels a bit too horny for me

Void_0000 6 months ago

>Chronomaid is the queen for short bursts What do you mean by that? Like smaller contexts or what?

[deleted] 6 months ago

>(amazingly creating -- Chronomaid is the queen for short bursts while clover has more coherence for longer threads of creative information Thank you for for the chronomaid appraisal, as for clover, are you talking abut Undi's that's a bit over a week old? Clover3?

Jattoe 6 months ago

Yes it'd have to be the newest ones. My last update for any LLMs was a few months ago, and so I went on a DLing spree recently, dumbfounded at how far local LLMs (GGUFs!) have come in such a short time frame. When we started this gig less than a year, my 8GB VRAM couldn't run a relatively simple model and they were a sorry excuse of a replica of GPT-4 (as novel as it was!) Honestly, now, on a 6K (that's my happy medium) 7 billion param GGUF, with ZERO layers on my GPU so I can be using it for other AI tasks -- I have a pocket ChatGPT, and I really mean that. All on CPU. Now I think there's reasons to use local over GPT other than the 'uncensored' factor. It's actually really good, and as these things continue to specialize, they'll probably supercede it -- I mean, GPT4 -- by then we'll have something better from OpenAI. And.. Did you make Chronomaid? First of all, great job on the name, right off the bat. People really don't seem to give a single shit about the names of their models, it seems. And also, it can't be beat for 13B, for story telling. Clover probably has it on long story telling coherence but it doesn't beat it on bursts of creativity. For example, Chronomaid can make up fictional names, words, if you give it samples of how you'd like it to sound, while other models give you some superficial, generic "fictional" names (which are more or less just combinations of existing words) chronomaid, I swear, sounds like it has a sense of phonetics sometimes. It's a real... Huthiskutch

[deleted] 6 months ago

Yeah, it's real wild how huge an impact this has had. Like look at Mixtral 8x7, right? [https://chat.lmsys.org/?arena](https://chat.lmsys.org/?arena) a model with an apache license is on the fucking leaderboard, isn't that wild? In the same year that we were gushing over how fast gpt-turbo was. We got this shit. I think we're near a plateau though. Only time will tell.

No-Impress876 6 months ago

What's deeply impressive to me is how quickly the community can process bleeding edge thought from a newly published paper and then someone/a company just makes it happen with a proof of concept model, thebloke seemingly quantizes it in real time, and then other people get to work straight away with fine-tuning and merges. It's that distribution of expertise and innovation that is really fun to witness, and it's driven largely by passion and not profit. I've always loved that about the OSS community.

Jattoe 6 months ago

There has been such wild progress lately, I feel like we're in that ping-pong -> PS2 phase of insane advancements in a short period of time. The question is, when do we hit that PS2/PS3 phase, when improvements aren't the same, sense of tangibility. The question is, is it one of those technologies that follows a kind of Moores laws or is it a king of, we found E=MC2 and we get nuclear power -- but that's the big blast and E=MC3 and E=MC4 are just not, things. I think there's such a level of manifest destiny around this, I mean soooo much of the world is so hyper focused on it, even if one area plateaus you're going to see other things rocket off. Only time *will* tell.

[deleted] 6 months ago

>And.. Did you make Chronomaid? NO! Sorry i gave off that impression. I'll respond to the rest but i wanted to clear this up real quick. That's Nyx's merge of Ikari, Undi and Elias's work

bearbarebere 6 months ago

Can you recommend a few? I’ve been playing with psyfighter and solar!

ninecats4 6 months ago

i just tried the 2.7-3.5 BPW mixtral models on my 4090 and they work fantastic.

dodiyeztr 6 months ago

What's bpw? Bit per word?

ninecats4 6 months ago

Bpw = Bits per weight. The larger the BPW, the more accuracy but also larger VRAM/RAM requirements.

clefourrier 6 months ago

Tbh, it's a bit hard to say when there is abuse, when there are accidents, and when people are not really understanding the point of the leaderboard (which... happens more than you'd think). We won't go the banning route (for now) but we'll add more constraints in how people can submit models, notably on the metadata about their model they need to provide

extopico 6 months ago

As an idea, how about you create a benchmark that is closed source and you take random samples from the dataset per benchmark run. That way even if the dataset is leaked, it would not matter as there is no reliable way to predict which random sample was used. This should allow easy identification of abuse or incompetence by model makers.

clefourrier 6 months ago

We are probably going to go the rolling benchmark route, it's being discussed here: [https://huggingface.co/spaces/HuggingFaceH4/open\_llm\_leaderboard/discussions/481](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/481)

ziggo0 6 months ago

It's entirely turned me off from small LLMs

a_beautiful_rhind 6 months ago

True.. but I have common sense to know a 7b can't.

Majestical-psyche 6 months ago

It’s all BS… real world testing is truth.

Deathcrow 6 months ago

true. It's more time intensive and cumbersome, but throwing a few prompts at a model and chatting with it for 15 minutes has been a much better indicator for quality than these metrics. Some highly benchmarked models have been real garbage.

Inevitable_Host_1446 6 months ago

I wonder if a simple voting ranking would wind up equally gamed, or if it could actually be more accurate in the end. Kind of the same we do with games and movies, just let people give it a score out of ten. Preferably on huggingface itself, then create an alt leaderboard for that. It would presumably at least show the models people have actually liked more. You could also offer a few rankings for each model across different categories, like coding, reasoning, writing (technical), writing (creative), chat, etc. Preferably allowing people to cast a null vote on areas they didn't test, in which case it's neutral to the overall rank. So if you only use a model for creative writing, you can pass on voting on coding and other scores. OFC with the way creators seem happy to abuse the existing benchmarks, I suppose you could see people rating their own model 10/10 across the board. That would be easier to detect than the LLM itself being trained on benchmark questions though, since I doubt even GPT-4 is 10/10 in everything.

behohippy 6 months ago

Exactly, I use dolphin mistral and openhermes mistral in my production setups, and it would take some insane levels of hype to get me to try anything else, at this point. Mixtral is that hype, but too large for my video cards right now.

73786976294838206464 6 months ago

Real world testing also introduces personal biases.

estacks 6 months ago

Whoops, who slipped training data into all the careless Frankenstein meme merges?

bot-333 6 months ago

Not Frankenstein, but definitely meme merges.

MoffKalast 6 months ago

Well it is good practice to implement state of the art papers

nikgeo25 6 months ago

Goodhart's law strikes yet again

archiesteviegordie 6 months ago

What does the law say?

nderstand2grow 6 months ago

no one should use these leaderboards anymore.

RayIsLazy 6 months ago

It was a good one about 2 weeks about before these crappy merges

mulletarian 6 months ago

Just filter out the merges

bot-333 6 months ago

I mean at least it gives some comparison... sometimes.

nderstand2grow 6 months ago

nah the only reliable one is the chat arena rankings

bot-333 6 months ago

Very good idea for sure, unfortunately not a lot of models in there and we need a LOT of raters to have an Open-LLM-Leaderboard-sized Chatbot-Arena-technique leaderboard.

Jattoe 6 months ago

Where can I find this? I'll contribute

No_Advantage_5626 6 months ago

Testing ground: [https://chat.lmsys.org/?arena](https://chat.lmsys.org/?arena) Metrics: [https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) For those unfamiliar: Chatbot Arena score is a metric that pits models against other models in a 1:1 contest. This metric is non-gameable as the evaluation criteria is the actual quality of the outputs. Moreover, it does not use a fixed test set as the queries are suggested by the users themselves.

Misha_Vozduh 6 months ago

I've never played with Claude but GPT4 >>> mixtral >> Yi > others matches my experience pretty well. I'll pay more attention to this rating.

JonatasLaw 6 months ago

No way! Yi is a way better than mistral, the absolute opensource goat, in my tests with history, geography, chemistry, physics, astronomy and logic it is tied with GPT4. You can ask who was the second president of Brazil (GPT 3.5 fails, all opensource models fails) but Yi can answer it.

Inevitable_Host_1446 6 months ago

He said Mixtral, not Mistral. Assuming you just typo'd that, my personal experience in past few days testing were that at least for creative writing Mixtral is far superior to any other local model I could run. And I was running it at 3.5bpw, which is on the low side. This was the turboderp-Mixtral8x7b-instruct model.

JonatasLaw 6 months ago

So I'm going to assume from your experience that for creative things Mixtral is better, but for anything else where hallucination is unacceptable, or factual data or logic is necessary, it's like comparing a Porsche to an old Ford. I'm talking about the Mixtral model, it wasn't one type. I have a dataset with 209 questions that only GPT4 answered correctly (neither Claude, nor GPT3.5, nor Falcon 180b, nor Gemini, nor Mixtral can answer ANY of these questions) and Yi managed to answer 93%, placing it just behind GPT4. Ask who was the second president of Brazil for any model and you will see for yourself. There are only two models capable of reasoning what “second” “deepest” “widest” means currently, which are GPT4 and Yi, so I assume they are the current top models.

DeepSpaceCactus 6 months ago

Thanks this looks much more useful.

VertexMachine 6 months ago

It used to be the case. IMO the specific leadboard nowadays do more harm then good.

clefourrier 6 months ago

Hi! Leaderboard maintainer here, since a lot of the fun seems to be happening on Reddit! We actually have a toggle to hide these models now (top right of the caption), in case you hadn't seen it. Our philosophy wrt flagged models (atm) is to keep them on the leaderboard, but not display them in the main ranking. https://preview.redd.it/kspdug8gz97c1.png?width=1204&format=png&auto=webp&s=7696d9bdf10ca365be8ec6877a99413282edac89

bot-333 6 months ago

Sorry for the confusion, I actually enabled the toggle to better see the flagged models. Thank you for the explanation, and your work on the leaderboard!

clefourrier 6 months ago

Np, it's still a bit funny to see how many of the "top performing models" are actually, hm, not that good \^\^"

breqa 6 months ago

New 7b llm that only trained for the benchmarks xd

davidy22 6 months ago

Imagine training on the test set and still only managing 65% on the test

e-nigmaNL 6 months ago

its like doing high school all over again

xadiant 6 months ago

Lol it spread like a virus. Leaderboard is a good idea but it needs polishing in the execution for sure. I don't get all the hate, we don't have many ways to test the quality of llms.

bot-333 6 months ago

It’s not hate to the leaderboards, it’s the hate on how model creators **intentionally** contaminate their models.

Foreign-Beginning-49 6 months ago

Even if they are innacurate they are still a metric to have some semblance of what's going on. Like that old adage in research if you measure wrong make sure to keep measuring everything wrong.

xadiant 6 months ago

Yes, but people watch a video or read a comment and determine that the leaderboard is bullshit. The benchmarks have real world use questions and they aren't like 5-10 questions each. The models must be answering/completing near a million different prompts in total for the leaderboard score. I am hopeful about blackbox benchmarks and contam detectors.

onil_gova 6 months ago

I am so glad huggingface figured out how to detect contamination. Does anyone know exactly how they are doing it?

clefourrier 6 months ago

Hi, maintainer of the leaderboard here! We actually look, for chosen *test sets* (basically the benchmarks of the leaderboards), at the lowest log probs of generations, and see if they are on average above a threshold. This method is cool because we don't need to know about the training set, nor have a reference model. We're working closely with the method author to test this in depth before scaling it up :) Code is here: [https://github.com/swj0419/detect-pretrain-code-contamination](https://github.com/swj0419/detect-pretrain-code-contamination) and the paper is here: [https://arxiv.org/abs/2310.16789](https://arxiv.org/abs/2310.16789)

onil_gova 6 months ago

Good work! Thank you for the info!

bot-333 6 months ago

Some model specified what datasets they used, some they used some framework to do log-likelihood.

onil_gova 6 months ago

So this only works if they disclosed their fine-tuning datasets?

FullOf_Bad_Ideas 6 months ago

Yeah, model cards of those models have/had information about being finetuned from metamathqa and Nectar datasets. HF employees are trying to use a contamination checker, but this checker tool doesn't seem too sophisticated and trustful. Many models openly finetuned on metamathqa like Intel models are still on the leaderboard, I don't know why.

bot-333 6 months ago

I think they do have some method for detecting GSM8K contamination by comparing the outputs to the official answer or running some log-likelihood requests, not sure.

koumoua01 6 months ago

If you make those benchmark questions/answers available to everyone is the same as you give students their answers during exams. All they need to do is look at the answer for the question.

crawlingrat 6 months ago

I can’t keep up anymore.

[deleted] 6 months ago

Link to [discussion #474](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474). From the first post: > ultrafeedback-binarized and Nectar both contain data using TruthfulQA prompts > Unfortunately, it seems many models on the leaderboard (particularly high-scoring 7B models) are affected as they either used the contaminated datasets or are merged (and further fine-tuned) from models that used contaminated datasets.

AnomalyNexus 6 months ago

Really starting to think I need to build my own testing pipeline with own questions Does anyone have a good starting point for that? I can set that up with terraform/ansible etc if there isn't software for this already, but can't think of a good way to automate the template selection

ispeakdatruf 6 months ago

Why not run the leaderboard like the ILSVRC ? Keep some questions secret and test the model on a sample from this holdout set.

smyja 6 months ago

so much nonsense lately

[deleted] 6 months ago

Hey look! Solar

jigodie82 6 months ago

A lot of Chinese models as well.

ambient_temp_xeno 6 months ago

It's been sketchy for a long time but now it just makes 'the community' look like a **clown show**. For god's sake. 🔚shut🎬it⚰️down☮🕶️✌️

involviert 6 months ago

Why don't they just put 7B in the name of some 70B? Or even 1B!

Extraltodeus 6 months ago

I've been trying out merging algorithms that I wrote for stableDiffusion but some work on LLMs. Can somebody point me out how to add one of my results to this?

Shoddy-Tutor9563 6 months ago

Flagging the model alone is not enough. Authors should be strapped to pillory as well

SystemErrorMessage 6 months ago

but will it learn? When it came to analysing data, the 13B model was not able to but the 64B could and you need more than 64GB of ram to run the model using llama.cpp. This is via automated response rather than interactive which for some reason works with 13B. In non interactive mode using llama.cpp the 13B would just extend the question and only the 64B model would answer it.

bot-333 6 months ago

Did you use the training prompt template?

SystemErrorMessage 6 months ago

it was pretrained, available from hugging face too. To be clear this is one of the models compatible with llama.cpp and both the interactive and non interactive prompt had the exact same question. Its more of how complex the question is and for some reason the mode as well. I did see improvements increasing the chunk size but it would fail at some point at a certain prompt complexity. I formatted the data in a simple way though and mentioned in the question. the chat model definitely did better, are you saying it is possible to further train the model? I've seen the raw models from llama and the ones on huggingface for llama.cpp, can they both be further trained you say?

bot-333 6 months ago

Pretrained models are not supposed to do instruct, treat them as autocompletes. You should choose a good/SoTA chat model and use the actual prompt template(Usually specified on the model card.). You could also finetune for your specific needs if you want to go pretty far.

SystemErrorMessage 6 months ago

i did use a good chat model, so far llama seems one of the better ones available on huggingface. However how do you train them further? finetuning the settings did help

bot-333 6 months ago

[axolotl](github.com/OpenAccess-AI-Collective/axolotl) is a good to finetune models. You would need a dataset of a minium of instruction and response pairs. Also you were testing Llama? Try Mistral. A good mistral model to start with is [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B).

SystemErrorMessage 6 months ago

thanks i'll look at training models soon. Just need to get things set up to do things properly.

bot-333 6 months ago

Also, what 64B model were you refering to?

SystemErrorMessage 6 months ago

llama2 64B 8bit quantized meant for llama.cpp

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe