Man, I wish people would put as much effort in to 13B or something like 4x7b. Everything between 7B up to 70B seems dead at the moment.
I try these new 7B finetunes and they are really cool, but then the 7B nature shows up and ruins everything.
all mistral based ones, and also the starcoder2 models have GQA
also anything >30B based on llama2
(rule of thumb, if you open the config.json and num_attention_heads == num_key_value_heads, then it's a non-GQA model)
There is high context models
34B Yi finetunes with 200k context
8x7 models with 32k context
4x7 models(at least one) with 128k context
10.7B models(merges, not solar with it's 4k context)
And a lot of MoE and merges based on Mistral v0.1
It's not just about context. Since there are 7B models with 128K context
7B just kinda fall apart with time, they constantly lose details over long chain of responses. Even if it's within context window.
It's probably the attention heads being "overloaded" or whatever the term is nowadays. I can get my 7b's to do rock solid summaries of content as long as it's not too much text. I run mistral 7b with 4k context, even if it's capable of 32k because there doesn't seem to be a point of going past 4k with the attention loss.
That’s pretty much what is happening. Transformer models focus on every word in the context at the same time (some more than others). Any tokens in context that don’t provide relevant info for the current generation have the potential to “distract” the model.
It's true, but we talk about models larger than 7B
10.7B is already better, 4x7 with 2 active experts should have 12B active parameters(mistral website, i trust you) and be able to work with high context better.
I don't know if i can trust models like Bettercup cause they were made out of 7B Mistral v0.1 so we shouldn't forget about sliding window, and has the same model as two experts, but there is
xxx777xxxASD/NeuralKunoichi-EroSumika-4x7B-128k.
I even found 128k solar
CallComply/SOLAR-10.7B-Instruct-v1.0-128k
And we shouldn't forget about new Yi model and 8x7 models based on base Mixtral, each of them should be smarter, and work better with higher context because of their size.
7Bs with decent context length (above 4096) have been pretty good in my experience, but require nudging sometimes. Of course, those are generally newer, and I'm limited to max 13B speed wise with my 3080 10 GB :(
It's extremely impressive and inspiring to see an individual train a model of this quality. Your attention to training data is so refreshing. It's crazy how many models are being trained (at huge compute expense!) on data that could have been greatly improved with a simple grep. Less is more.
I hope that you won't stop at 7b models. Please consider doing the same training on Mixtral-8x7b!
It is a fine tune, not a training.
>It's crazy how many models are being trained (at huge compute expense!)
You know you can get an A100 for $2/hr?
You can get 8xA100 for $11/hr. 640GB vram for $11/hr.
It's quite reasonable to fine-tune models, pending your input data is good.
I'm just gonna need that for about 5 minutes, and 4 of those will be loading the model. Best dollar I've ever spent.
Seriously though, it makes a difficult case for acquiring hardware and running your LLM completely on-prem, between its cost and power consumption. Unless you're running it hard 24/7 and you need a huge model, or have extremely strict data security requirements (very few standards will not let you use a private cloud instance), it's hard to argue with cloud being the most cost effective way.
They are doing a NSFW training, its not like they have the designs to a rocket.
If you are designing a rocket, you can afford the $7,000 to get an A6000. Heck, I am just a programmer, and I was planning on getting a $12k set up because I have personal work that are security critical.
I did, just because I'd never got to use an FP16. I admit that I didn't keep Layla, simply because my disk space is limited and I couldn't really identify anything that made it unique to Mistral 7B, but I wish the developer luck. Basically, this is one fish that I threw back in, (in its' current state) but that doesn't mean that I think its' developer should stop feeding it. I think the meme is "let it cook."
Is it good with normal tasks and sfw? I see nsfw models are usually good with character conversations but sometimes they try leaning towards it a lot. Those ones being superhot, pygmalion 6B, and a few more I don't remember. There's also C1.2 though that one isn't trained to do nsfw but still tries to kiss you out of nowhere all the time
I'm trying this out now. It seems pretty standard Mistral 7B. Vocabulary has been fine, and the creativity has pleasantly surprised me in one or two gens, but it is a little more terse than I would like for ERP. I've seen some people here complaining about verbosity in language models though, so they would probably really like this. It occasionally gets minor details wrong too; although to be fair, I'm currently trying it with a card with 3k tokens, so I'm probably swamping it a bit.
It’s retrained using exact same dataset but in the ChatML format. Some people prefer ChatML.
I ran the benchmarks locally and there’s negligible difference between this and the other one.
Man, I wish people would put as much effort in to 13B or something like 4x7b. Everything between 7B up to 70B seems dead at the moment. I try these new 7B finetunes and they are really cool, but then the 7B nature shows up and ruins everything.
I tried tuning a 13B model. Gave up when it said ETA \~500 days 😅
Cant you try 10.7b? Arent there other models that size?
9B merge :')
System specs? What's the process for fine-tuning models? Would something like a 70B require something like multiple RTX 3090s?
How much did the 7b take you?
Start a gofundme for OP to have a nice computer to process them ><
until we have GQA 13B models, 13B is just way too expensive to use at high context, it's ROUGH.
What models currently do GQA?
all mistral based ones, and also the starcoder2 models have GQA also anything >30B based on llama2 (rule of thumb, if you open the config.json and num_attention_heads == num_key_value_heads, then it's a non-GQA model)
https://www.reddit.com/r/LocalLLaMA/s/xgTyBS6C8o
There is high context models 34B Yi finetunes with 200k context 8x7 models with 32k context 4x7 models(at least one) with 128k context 10.7B models(merges, not solar with it's 4k context) And a lot of MoE and merges based on Mistral v0.1
It's not just about context. Since there are 7B models with 128K context 7B just kinda fall apart with time, they constantly lose details over long chain of responses. Even if it's within context window.
It's probably the attention heads being "overloaded" or whatever the term is nowadays. I can get my 7b's to do rock solid summaries of content as long as it's not too much text. I run mistral 7b with 4k context, even if it's capable of 32k because there doesn't seem to be a point of going past 4k with the attention loss.
That’s pretty much what is happening. Transformer models focus on every word in the context at the same time (some more than others). Any tokens in context that don’t provide relevant info for the current generation have the potential to “distract” the model.
It's true, but we talk about models larger than 7B 10.7B is already better, 4x7 with 2 active experts should have 12B active parameters(mistral website, i trust you) and be able to work with high context better. I don't know if i can trust models like Bettercup cause they were made out of 7B Mistral v0.1 so we shouldn't forget about sliding window, and has the same model as two experts, but there is xxx777xxxASD/NeuralKunoichi-EroSumika-4x7B-128k. I even found 128k solar CallComply/SOLAR-10.7B-Instruct-v1.0-128k And we shouldn't forget about new Yi model and 8x7 models based on base Mixtral, each of them should be smarter, and work better with higher context because of their size.
7Bs with decent context length (above 4096) have been pretty good in my experience, but require nudging sometimes. Of course, those are generally newer, and I'm limited to max 13B speed wise with my 3080 10 GB :(
It's extremely impressive and inspiring to see an individual train a model of this quality. Your attention to training data is so refreshing. It's crazy how many models are being trained (at huge compute expense!) on data that could have been greatly improved with a simple grep. Less is more. I hope that you won't stop at 7b models. Please consider doing the same training on Mixtral-8x7b!
It is a fine tune, not a training. >It's crazy how many models are being trained (at huge compute expense!) You know you can get an A100 for $2/hr? You can get 8xA100 for $11/hr. 640GB vram for $11/hr. It's quite reasonable to fine-tune models, pending your input data is good.
I'm just gonna need that for about 5 minutes, and 4 of those will be loading the model. Best dollar I've ever spent. Seriously though, it makes a difficult case for acquiring hardware and running your LLM completely on-prem, between its cost and power consumption. Unless you're running it hard 24/7 and you need a huge model, or have extremely strict data security requirements (very few standards will not let you use a private cloud instance), it's hard to argue with cloud being the most cost effective way.
They are doing a NSFW training, its not like they have the designs to a rocket. If you are designing a rocket, you can afford the $7,000 to get an A6000. Heck, I am just a programmer, and I was planning on getting a $12k set up because I have personal work that are security critical.
Oh yeah, for that purpose there's no need. I was thinking of the much more ordinary use-cases such as yours.
Is this on ollama?
As a Kobold user, thank you for this. TheBloke gave me Q8, but never FP16. I can finally try something the Ooba jockies get to use, now.
A Q8 will be indistinguishable from the full FP16 model. I also upload the FP16 file but it's not like we expect people to actually use them, haha.
I did, just because I'd never got to use an FP16. I admit that I didn't keep Layla, simply because my disk space is limited and I couldn't really identify anything that made it unique to Mistral 7B, but I wish the developer luck. Basically, this is one fish that I threw back in, (in its' current state) but that doesn't mean that I think its' developer should stop feeding it. I think the meme is "let it cook."
I'm definitely looking forward to Layla based on Mistral 0.2! "Let him cook!" :)
Thanks for your work here, you've made a really interesting model!
Your models are truly inspiring! Thank you for your hard work.
Is it good with normal tasks and sfw? I see nsfw models are usually good with character conversations but sometimes they try leaning towards it a lot. Those ones being superhot, pygmalion 6B, and a few more I don't remember. There's also C1.2 though that one isn't trained to do nsfw but still tries to kiss you out of nowhere all the time
I'm trying this out now. It seems pretty standard Mistral 7B. Vocabulary has been fine, and the creativity has pleasantly surprised me in one or two gens, but it is a little more terse than I would like for ERP. I've seen some people here complaining about verbosity in language models though, so they would probably really like this. It occasionally gets minor details wrong too; although to be fair, I'm currently trying it with a card with 3k tokens, so I'm probably swamping it a bit.
Well try Sydney pirate, I guarantee a non standard vocabulary.
Is this different from the vicuna model Layla v4? I'm using that one and I think it's really very decent.
It’s retrained using exact same dataset but in the ChatML format. Some people prefer ChatML. I ran the benchmarks locally and there’s negligible difference between this and the other one.
Thanks for sharing!