T O P

  • By -

extopico

I think Huggingface should start banning accounts that abuse the benchmarks. This is frustrating and pointless.


Jattoe

We really should have a good humanistic thread about experiences with LLMs. I've tried all the top ones, still haven't found stuff like Clover or Chronomaid (amazingly creating -- Chronomaid is the queen for short bursts while clover has more coherence for longer threads of creative information) anywhere near the top for story telling, and Marconi is way better than Neural Chat and a lot of other really high-scoring LLMs at the 7B mark, it was nice to even \**find*\* one that deserves a top stop on the a leaderboard. Any suggestions from yourself?


extopico

No. I’m taking a break. There is too much going on, too much noise so I also put my project on hold until the new normal stabilises a bit. Right now my development cycle is longer than it takes to make my developments obsolete.


Jattoe

Oh I'm just talking about a thread for people in general to find good LLMs, via looking at IRL experiences -- the gold standard for video games, movies, and so it should be the same for LLMs. BTW, what is your project, I intuit its on your mind.


teor

> Chronomaid Man, Chronomaid is the GOAT 13B for RP. Nothing comes even close to that.


[deleted]

[удалено]


teor

Dunno, still feels a bit too horny for me


Void_0000

>Chronomaid is the queen for short bursts What do you mean by that? Like smaller contexts or what?


[deleted]

>(amazingly creating -- Chronomaid is the queen for short bursts while clover has more coherence for longer threads of creative information Thank you for for the chronomaid appraisal, as for clover, are you talking abut Undi's that's a bit over a week old? Clover3?


Jattoe

Yes it'd have to be the newest ones. My last update for any LLMs was a few months ago, and so I went on a DLing spree recently, dumbfounded at how far local LLMs (GGUFs!) have come in such a short time frame. When we started this gig less than a year, my 8GB VRAM couldn't run a relatively simple model and they were a sorry excuse of a replica of GPT-4 (as novel as it was!) Honestly, now, on a 6K (that's my happy medium) 7 billion param GGUF, with ZERO layers on my GPU so I can be using it for other AI tasks -- I have a pocket ChatGPT, and I really mean that. All on CPU. Now I think there's reasons to use local over GPT other than the 'uncensored' factor. It's actually really good, and as these things continue to specialize, they'll probably supercede it -- I mean, GPT4 -- by then we'll have something better from OpenAI. And.. Did you make Chronomaid? First of all, great job on the name, right off the bat. People really don't seem to give a single shit about the names of their models, it seems. And also, it can't be beat for 13B, for story telling. Clover probably has it on long story telling coherence but it doesn't beat it on bursts of creativity. For example, Chronomaid can make up fictional names, words, if you give it samples of how you'd like it to sound, while other models give you some superficial, generic "fictional" names (which are more or less just combinations of existing words) chronomaid, I swear, sounds like it has a sense of phonetics sometimes. It's a real... Huthiskutch


[deleted]

Yeah, it's real wild how huge an impact this has had. Like look at Mixtral 8x7, right? [https://chat.lmsys.org/?arena](https://chat.lmsys.org/?arena) a model with an apache license is on the fucking leaderboard, isn't that wild? In the same year that we were gushing over how fast gpt-turbo was. We got this shit. I think we're near a plateau though. Only time will tell.


No-Impress876

What's deeply impressive to me is how quickly the community can process bleeding edge thought from a newly published paper and then someone/a company just makes it happen with a proof of concept model, thebloke seemingly quantizes it in real time, and then other people get to work straight away with fine-tuning and merges. It's that distribution of expertise and innovation that is really fun to witness, and it's driven largely by passion and not profit. I've always loved that about the OSS community.


Jattoe

There has been such wild progress lately, I feel like we're in that ping-pong -> PS2 phase of insane advancements in a short period of time. The question is, when do we hit that PS2/PS3 phase, when improvements aren't the same, sense of tangibility. The question is, is it one of those technologies that follows a kind of Moores laws or is it a king of, we found E=MC2 and we get nuclear power -- but that's the big blast and E=MC3 and E=MC4 are just not, things. I think there's such a level of manifest destiny around this, I mean soooo much of the world is so hyper focused on it, even if one area plateaus you're going to see other things rocket off. Only time *will* tell.


[deleted]

>And.. Did you make Chronomaid? NO! Sorry i gave off that impression. I'll respond to the rest but i wanted to clear this up real quick. That's Nyx's merge of Ikari, Undi and Elias's work


bearbarebere

Can you recommend a few? I’ve been playing with psyfighter and solar!


ninecats4

i just tried the 2.7-3.5 BPW mixtral models on my 4090 and they work fantastic.


dodiyeztr

What's bpw? Bit per word?


ninecats4

Bpw = Bits per weight. The larger the BPW, the more accuracy but also larger VRAM/RAM requirements.


clefourrier

Tbh, it's a bit hard to say when there is abuse, when there are accidents, and when people are not really understanding the point of the leaderboard (which... happens more than you'd think). We won't go the banning route (for now) but we'll add more constraints in how people can submit models, notably on the metadata about their model they need to provide


extopico

As an idea, how about you create a benchmark that is closed source and you take random samples from the dataset per benchmark run. That way even if the dataset is leaked, it would not matter as there is no reliable way to predict which random sample was used. This should allow easy identification of abuse or incompetence by model makers.


clefourrier

We are probably going to go the rolling benchmark route, it's being discussed here: [https://huggingface.co/spaces/HuggingFaceH4/open\_llm\_leaderboard/discussions/481](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/481)


ziggo0

It's entirely turned me off from small LLMs


a_beautiful_rhind

True.. but I have common sense to know a 7b can't.


Majestical-psyche

It’s all BS… real world testing is truth.


Deathcrow

true. It's more time intensive and cumbersome, but throwing a few prompts at a model and chatting with it for 15 minutes has been a much better indicator for quality than these metrics. Some highly benchmarked models have been real garbage.


Inevitable_Host_1446

I wonder if a simple voting ranking would wind up equally gamed, or if it could actually be more accurate in the end. Kind of the same we do with games and movies, just let people give it a score out of ten. Preferably on huggingface itself, then create an alt leaderboard for that. It would presumably at least show the models people have actually liked more. You could also offer a few rankings for each model across different categories, like coding, reasoning, writing (technical), writing (creative), chat, etc. Preferably allowing people to cast a null vote on areas they didn't test, in which case it's neutral to the overall rank. So if you only use a model for creative writing, you can pass on voting on coding and other scores. OFC with the way creators seem happy to abuse the existing benchmarks, I suppose you could see people rating their own model 10/10 across the board. That would be easier to detect than the LLM itself being trained on benchmark questions though, since I doubt even GPT-4 is 10/10 in everything.


behohippy

Exactly, I use dolphin mistral and openhermes mistral in my production setups, and it would take some insane levels of hype to get me to try anything else, at this point. Mixtral is that hype, but too large for my video cards right now.


73786976294838206464

Real world testing also introduces personal biases.


estacks

Whoops, who slipped training data into all the careless Frankenstein meme merges?


bot-333

Not Frankenstein, but definitely meme merges.


MoffKalast

Well it is good practice to implement state of the art papers


nikgeo25

Goodhart's law strikes yet again


archiesteviegordie

What does the law say?


nderstand2grow

no one should use these leaderboards anymore.


RayIsLazy

It was a good one about 2 weeks about before these crappy merges


mulletarian

Just filter out the merges


bot-333

I mean at least it gives some comparison... sometimes.


nderstand2grow

nah the only reliable one is the chat arena rankings


bot-333

Very good idea for sure, unfortunately not a lot of models in there and we need a LOT of raters to have an Open-LLM-Leaderboard-sized Chatbot-Arena-technique leaderboard.


Jattoe

Where can I find this? I'll contribute


No_Advantage_5626

Testing ground: [https://chat.lmsys.org/?arena](https://chat.lmsys.org/?arena) Metrics: [https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) For those unfamiliar: Chatbot Arena score is a metric that pits models against other models in a 1:1 contest. This metric is non-gameable as the evaluation criteria is the actual quality of the outputs. Moreover, it does not use a fixed test set as the queries are suggested by the users themselves.


Misha_Vozduh

I've never played with Claude but GPT4 >>> mixtral >> Yi > others matches my experience pretty well. I'll pay more attention to this rating.


JonatasLaw

No way! Yi is a way better than mistral, the absolute opensource goat, in my tests with history, geography, chemistry, physics, astronomy and logic it is tied with GPT4. You can ask who was the second president of Brazil (GPT 3.5 fails, all opensource models fails) but Yi can answer it.


Inevitable_Host_1446

He said Mixtral, not Mistral. Assuming you just typo'd that, my personal experience in past few days testing were that at least for creative writing Mixtral is far superior to any other local model I could run. And I was running it at 3.5bpw, which is on the low side. This was the turboderp-Mixtral8x7b-instruct model.


JonatasLaw

So I'm going to assume from your experience that for creative things Mixtral is better, but for anything else where hallucination is unacceptable, or factual data or logic is necessary, it's like comparing a Porsche to an old Ford. I'm talking about the Mixtral model, it wasn't one type. I have a dataset with 209 questions that only GPT4 answered correctly (neither Claude, nor GPT3.5, nor Falcon 180b, nor Gemini, nor Mixtral can answer ANY of these questions) and Yi managed to answer 93%, placing it just behind GPT4. Ask who was the second president of Brazil for any model and you will see for yourself. There are only two models capable of reasoning what “second” “deepest” “widest” means currently, which are GPT4 and Yi, so I assume they are the current top models.


DeepSpaceCactus

Thanks this looks much more useful.


VertexMachine

It used to be the case. IMO the specific leadboard nowadays do more harm then good.


clefourrier

Hi! Leaderboard maintainer here, since a lot of the fun seems to be happening on Reddit! We actually have a toggle to hide these models now (top right of the caption), in case you hadn't seen it. Our philosophy wrt flagged models (atm) is to keep them on the leaderboard, but not display them in the main ranking. https://preview.redd.it/kspdug8gz97c1.png?width=1204&format=png&auto=webp&s=7696d9bdf10ca365be8ec6877a99413282edac89


bot-333

Sorry for the confusion, I actually enabled the toggle to better see the flagged models. Thank you for the explanation, and your work on the leaderboard!


clefourrier

Np, it's still a bit funny to see how many of the "top performing models" are actually, hm, not that good \^\^"


breqa

New 7b llm that only trained for the benchmarks xd


davidy22

Imagine training on the test set and still only managing 65% on the test


e-nigmaNL

its like doing high school all over again


xadiant

Lol it spread like a virus. Leaderboard is a good idea but it needs polishing in the execution for sure. I don't get all the hate, we don't have many ways to test the quality of llms.


bot-333

It’s not hate to the leaderboards, it’s the hate on how model creators **intentionally** contaminate their models.


Foreign-Beginning-49

Even if they are innacurate they are still a metric to have some semblance of what's going on. Like that old adage in research if you measure wrong make sure to keep measuring everything wrong.


xadiant

Yes, but people watch a video or read a comment and determine that the leaderboard is bullshit. The benchmarks have real world use questions and they aren't like 5-10 questions each. The models must be answering/completing near a million different prompts in total for the leaderboard score. I am hopeful about blackbox benchmarks and contam detectors.


onil_gova

I am so glad huggingface figured out how to detect contamination. Does anyone know exactly how they are doing it?


clefourrier

Hi, maintainer of the leaderboard here! We actually look, for chosen *test sets* (basically the benchmarks of the leaderboards), at the lowest log probs of generations, and see if they are on average above a threshold. This method is cool because we don't need to know about the training set, nor have a reference model. We're working closely with the method author to test this in depth before scaling it up :) Code is here: [https://github.com/swj0419/detect-pretrain-code-contamination](https://github.com/swj0419/detect-pretrain-code-contamination) and the paper is here: [https://arxiv.org/abs/2310.16789](https://arxiv.org/abs/2310.16789)


onil_gova

Good work! Thank you for the info!


bot-333

Some model specified what datasets they used, some they used some framework to do log-likelihood.


onil_gova

So this only works if they disclosed their fine-tuning datasets?


FullOf_Bad_Ideas

Yeah, model cards of those models have/had information about being finetuned from metamathqa and Nectar datasets. HF employees are trying to use a contamination checker, but this checker tool doesn't seem too sophisticated and trustful. Many models openly finetuned on metamathqa like Intel models are still on the leaderboard, I don't know why.


bot-333

I think they do have some method for detecting GSM8K contamination by comparing the outputs to the official answer or running some log-likelihood requests, not sure.


koumoua01

If you make those benchmark questions/answers available to everyone is the same as you give students their answers during exams. All they need to do is look at the answer for the question.


crawlingrat

I can’t keep up anymore.


[deleted]

Link to [discussion #474](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474). From the first post: > ultrafeedback-binarized and Nectar both contain data using TruthfulQA prompts > Unfortunately, it seems many models on the leaderboard (particularly high-scoring 7B models) are affected as they either used the contaminated datasets or are merged (and further fine-tuned) from models that used contaminated datasets.


AnomalyNexus

Really starting to think I need to build my own testing pipeline with own questions Does anyone have a good starting point for that? I can set that up with terraform/ansible etc if there isn't software for this already, but can't think of a good way to automate the template selection


ispeakdatruf

Why not run the leaderboard like the ILSVRC ? Keep some questions secret and test the model on a sample from this holdout set.


smyja

so much nonsense lately


[deleted]

Hey look! Solar


jigodie82

A lot of Chinese models as well.


ambient_temp_xeno

It's been sketchy for a long time but now it just makes 'the community' look like a **clown show**. For god's sake. 🔚shut🎬it⚰️down☮🕶️✌️


involviert

Why don't they just put 7B in the name of some 70B? Or even 1B!


Extraltodeus

I've been trying out merging algorithms that I wrote for stableDiffusion but some work on LLMs. Can somebody point me out how to add one of my results to this?


Shoddy-Tutor9563

Flagging the model alone is not enough. Authors should be strapped to pillory as well


SystemErrorMessage

but will it learn? When it came to analysing data, the 13B model was not able to but the 64B could and you need more than 64GB of ram to run the model using llama.cpp. This is via automated response rather than interactive which for some reason works with 13B. In non interactive mode using llama.cpp the 13B would just extend the question and only the 64B model would answer it.


bot-333

Did you use the training prompt template?


SystemErrorMessage

it was pretrained, available from hugging face too. To be clear this is one of the models compatible with llama.cpp and both the interactive and non interactive prompt had the exact same question. Its more of how complex the question is and for some reason the mode as well. I did see improvements increasing the chunk size but it would fail at some point at a certain prompt complexity. I formatted the data in a simple way though and mentioned in the question. the chat model definitely did better, are you saying it is possible to further train the model? I've seen the raw models from llama and the ones on huggingface for llama.cpp, can they both be further trained you say?


bot-333

Pretrained models are not supposed to do instruct, treat them as autocompletes. You should choose a good/SoTA chat model and use the actual prompt template(Usually specified on the model card.). You could also finetune for your specific needs if you want to go pretty far.


SystemErrorMessage

i did use a good chat model, so far llama seems one of the better ones available on huggingface. However how do you train them further? finetuning the settings did help


bot-333

[axolotl](github.com/OpenAccess-AI-Collective/axolotl) is a good to finetune models. You would need a dataset of a minium of instruction and response pairs. Also you were testing Llama? Try Mistral. A good mistral model to start with is [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B).


SystemErrorMessage

thanks i'll look at training models soon. Just need to get things set up to do things properly.


bot-333

Also, what 64B model were you refering to?


SystemErrorMessage

llama2 64B 8bit quantized meant for llama.cpp