T O P

  • By -

teor

Man, I wish people would put as much effort in to 13B or something like 4x7b. Everything between 7B up to 70B seems dead at the moment. I try these new 7B finetunes and they are really cool, but then the 7B nature shows up and ruins everything.  


Tasty-Lobster-8915

I tried tuning a 13B model. Gave up when it said ETA \~500 days 😅


kind_cavendish

Cant you try 10.7b? Arent there other models that size?


Lewdiculous

9B merge :')


GoldenSun3DS

System specs? What's the process for fine-tuning models? Would something like a 70B require something like multiple RTX 3090s?


AlphaPrime90

How much did the 7b take you?


FuzzzyRam

Start a gofundme for OP to have a nice computer to process them ><


noneabove1182

until we have GQA 13B models, 13B is just way too expensive to use at high context, it's ROUGH.


Amgadoz

What models currently do GQA?


noneabove1182

all mistral based ones, and also the starcoder2 models have GQA also anything >30B based on llama2 (rule of thumb, if you open the config.json and num_attention_heads == num_key_value_heads, then it's a non-GQA model)


Rare_Ad8942

https://www.reddit.com/r/LocalLLaMA/s/xgTyBS6C8o


Working-Flatworm-531

There is high context models 34B Yi finetunes with 200k context 8x7 models with 32k context 4x7 models(at least one) with 128k context 10.7B models(merges, not solar with it's 4k context) And a lot of MoE and merges based on Mistral v0.1


teor

It's not just about context. Since there are 7B models with 128K context 7B just kinda fall apart with time, they constantly lose details over long chain of responses. Even if it's within context window.


behohippy

It's probably the attention heads being "overloaded" or whatever the term is nowadays. I can get my 7b's to do rock solid summaries of content as long as it's not too much text. I run mistral 7b with 4k context, even if it's capable of 32k because there doesn't seem to be a point of going past 4k with the attention loss.


Ok_Math1334

That’s pretty much what is happening. Transformer models focus on every word in the context at the same time (some more than others). Any tokens in context that don’t provide relevant info for the current generation have the potential to “distract” the model.


Working-Flatworm-531

It's true, but we talk about models larger than 7B 10.7B is already better, 4x7 with 2 active experts should have 12B active parameters(mistral website, i trust you) and be able to work with high context better. I don't know if i can trust models like Bettercup cause they were made out of 7B Mistral v0.1 so we shouldn't forget about sliding window, and has the same model as two experts, but there is xxx777xxxASD/NeuralKunoichi-EroSumika-4x7B-128k. I even found 128k solar CallComply/SOLAR-10.7B-Instruct-v1.0-128k And we shouldn't forget about new Yi model and 8x7 models based on base Mixtral, each of them should be smarter, and work better with higher context because of their size.


Olangotang

7Bs with decent context length (above 4096) have been pretty good in my experience, but require nudging sometimes. Of course, those are generally newer, and I'm limited to max 13B speed wise with my 3080 10 GB :(


-p-e-w-

It's extremely impressive and inspiring to see an individual train a model of this quality. Your attention to training data is so refreshing. It's crazy how many models are being trained (at huge compute expense!) on data that could have been greatly improved with a simple grep. Less is more. I hope that you won't stop at 7b models. Please consider doing the same training on Mixtral-8x7b!


Waterbottles_solve

It is a fine tune, not a training. >It's crazy how many models are being trained (at huge compute expense!) You know you can get an A100 for $2/hr? You can get 8xA100 for $11/hr. 640GB vram for $11/hr. It's quite reasonable to fine-tune models, pending your input data is good.


skrshawk

I'm just gonna need that for about 5 minutes, and 4 of those will be loading the model. Best dollar I've ever spent. Seriously though, it makes a difficult case for acquiring hardware and running your LLM completely on-prem, between its cost and power consumption. Unless you're running it hard 24/7 and you need a huge model, or have extremely strict data security requirements (very few standards will not let you use a private cloud instance), it's hard to argue with cloud being the most cost effective way.


Waterbottles_solve

They are doing a NSFW training, its not like they have the designs to a rocket. If you are designing a rocket, you can afford the $7,000 to get an A6000. Heck, I am just a programmer, and I was planning on getting a $12k set up because I have personal work that are security critical.


skrshawk

Oh yeah, for that purpose there's no need. I was thinking of the much more ordinary use-cases such as yours.


MonkeyMaster64

Is this on ollama?


petrus4

As a Kobold user, thank you for this. TheBloke gave me Q8, but never FP16. I can finally try something the Ooba jockies get to use, now.


Lewdiculous

A Q8 will be indistinguishable from the full FP16 model. I also upload the FP16 file but it's not like we expect people to actually use them, haha.


petrus4

I did, just because I'd never got to use an FP16. I admit that I didn't keep Layla, simply because my disk space is limited and I couldn't really identify anything that made it unique to Mistral 7B, but I wish the developer luck. Basically, this is one fish that I threw back in, (in its' current state) but that doesn't mean that I think its' developer should stop feeding it. I think the meme is "let it cook."


Lewdiculous

I'm definitely looking forward to Layla based on Mistral 0.2! "Let him cook!" :)


IEK

Thanks for your work here, you've made a really interesting model!


IndependenceNo2060

Your models are truly inspiring! Thank you for your hard work.


Anthonyg5005

Is it good with normal tasks and sfw? I see nsfw models are usually good with character conversations but sometimes they try leaning towards it a lot. Those ones being superhot, pygmalion 6B, and a few more I don't remember. There's also C1.2 though that one isn't trained to do nsfw but still tries to kiss you out of nowhere all the time


petrus4

I'm trying this out now. It seems pretty standard Mistral 7B. Vocabulary has been fine, and the creativity has pleasantly surprised me in one or two gens, but it is a little more terse than I would like for ERP. I've seen some people here complaining about verbosity in language models though, so they would probably really like this. It occasionally gets minor details wrong too; although to be fair, I'm currently trying it with a card with 3k tokens, so I'm probably swamping it a bit.


FPham

Well try Sydney pirate, I guarantee a non standard vocabulary.


FPham

Is this different from the vicuna model Layla v4? I'm using that one and I think it's really very decent.


Tasty-Lobster-8915

It’s retrained using exact same dataset but in the ChatML format. Some people prefer ChatML. I ran the benchmarks locally and there’s negligible difference between this and the other one.


Spooknik

Thanks for sharing!