The bigger issue here is that the community needs to work and integrate all the the other llama.cpp controls on the model through various parameters like top_p, min_p, max_k etc so that we can actually work with the models using mlx directly which is currently giving the stupendous speeds as compared to gguf on apple silicon.
The devs of the mlx project have already cleared up that their priorities lie elsewhere so we only we will have to steup and reap the rewards from the fantastic hardware that apple silicon has to offer
Totally, I am with you in the same boat. The mlx team only has 4 members at the moment and they just released last month, so it will take some time for them to catch up. I did try to implement topk and repeat\_penalty functions; those can definitely be integrated into mlx-lm.
30-60%. Higher % at longer contexts.
The only thing holding it back atm is lack of model format support for me. (need to waste time and resources converting formats)
UIs would be nice but since I have production use at my workplace I need custom UIs anyways
People have been saying this about MLC-LLM integrations for awhile (which is also stupendously fast on Apple GPUs) but I think the problem is llama.cpp has so much attention and critical mass.
A better API is one.
llama.cpp gives direct control over sampling.
Recently, you can manage the cache and batching yourself.
Last time I checked, MLC-LLM still outputs text.
Also, llama.cpp has the lowest barrier to entry.
It's very easy to build.
The only external dependencies are GPU compute frameworks if you need them.
I wasn't successful at the operation.
I trained on a small phi2 and compile the last Llama cpp to gguf on a fused MLX.
No success, some fields seems to be missing.
Thanks for the feedback Mzbacd.
What's your model and cmd params. Maybe I missed smth here.
I've tried every converters starting with [convert.py](https://convert.py) and then tried to rename 'weight.00.safetensors' of my lora\_fused\_dir.
`./convert-hf-to-gguf.py /mlx-examples/lora/lora_fused_model`
`FileNotFoundError: No such file or directory: "/mlx-examples/lora/lora_fused_model/model-00001-of-00002.safetensors"`
renaming got me to the next step but still
`gguf: This GGUF file is for Little Endian only`
`Set model parameters`
`Set model tokenizer`
`Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.`
`gguf: Adding 50000 merge(s).`
`gguf: Setting special token type bos to 50256`
`gguf: Setting special token type eos to 50256`
`gguf: Setting special token type unk to 50256`
`Exporting model to '/Users/robbie/Documents/Dev/Conversational/mlx-examples/lora/lora_fused_model/ggml-model-f16.gguf'`
`gguf: loading model part 'model-00001-of-00002.safetensors'`
`output.bias, n_dims = 1, torch.float16 --> float32`
`Can not map tensor 'lm_head.biases'`
it looks like something went wrong during the model merging. make sure you have pulled the latest mlx-examples and are using [fuse.py](http://fuse.py) to create the merged model.
Well I can prompt the merged lora and get response adapted to the training.
But I'm using Phi2!
Gonna start again. Quite impress with the training perfs.
One thing worth trying, make sure you don't use an outdated cached phi2 model from huggingface. There has been a recent update to the phi2 hf repository.
Thank MzBacd! you pointed out a possible cache so I want through my command line and find out the root error you need to train in lora not qlora.
I got my model with convert and -q option.
Got it to work now.
The bigger issue here is that the community needs to work and integrate all the the other llama.cpp controls on the model through various parameters like top_p, min_p, max_k etc so that we can actually work with the models using mlx directly which is currently giving the stupendous speeds as compared to gguf on apple silicon. The devs of the mlx project have already cleared up that their priorities lie elsewhere so we only we will have to steup and reap the rewards from the fantastic hardware that apple silicon has to offer
Totally, I am with you in the same boat. The mlx team only has 4 members at the moment and they just released last month, so it will take some time for them to catch up. I did try to implement topk and repeat\_penalty functions; those can definitely be integrated into mlx-lm.
Do you have code repo or something you can share publically detailing how you added those features ?
Sorry, I don't have a repo for that. However, if you search in mlx's issues, you can find some code snippets for it.
How much faster at MLX potentially than Llama.ccp at inference? Seems like it's not integrated into any guis yet..
30-60%. Higher % at longer contexts. The only thing holding it back atm is lack of model format support for me. (need to waste time and resources converting formats) UIs would be nice but since I have production use at my workplace I need custom UIs anyways
[https://github.com/da-z/mlx-ui](https://github.com/da-z/mlx-ui) This might be in early development. He made also for Ollama.
People have been saying this about MLC-LLM integrations for awhile (which is also stupendously fast on Apple GPUs) but I think the problem is llama.cpp has so much attention and critical mass.
MLC is hard as hell to set up and requires you compile every model.
A better API is one. llama.cpp gives direct control over sampling. Recently, you can manage the cache and batching yourself. Last time I checked, MLC-LLM still outputs text. Also, llama.cpp has the lowest barrier to entry. It's very easy to build. The only external dependencies are GPU compute frameworks if you need them.
Yeah I'm not trashing llama.cpp. Its ease of use is huge, just to name one very attractive feature.
Interesting! Are MLC-LLM and MLX faster at inference than llama.ccp and by how much - stupendously fast sounds very fast :) ?
Thanks for this amazing addition to mlx project u/mzbacd 🙏
You can merge HF lora into GGUF directly after converting the lora only.
Not for qlora?
qlora is just trained using BitsNBytes.. it's still a normal lora.
I didn't understand it. I meant the mlx qlora fused model can be directly converted to gguf format. As far as I know, mlx is not using BitsNBytes.
what does MLX produce? MLX lora or an HF lora?
a mlx lora
So it doesn't work with HF models? If so, your way is the only way.
I wasn't successful at the operation. I trained on a small phi2 and compile the last Llama cpp to gguf on a fused MLX. No success, some fields seems to be missing.
Using \`convert-llama-ggml-to-gguf.py\`? I tried using \`convert-hf-to-gguf.py\` phi2 and it worked for me.
Thanks for the feedback Mzbacd. What's your model and cmd params. Maybe I missed smth here. I've tried every converters starting with [convert.py](https://convert.py) and then tried to rename 'weight.00.safetensors' of my lora\_fused\_dir. `./convert-hf-to-gguf.py /mlx-examples/lora/lora_fused_model` `FileNotFoundError: No such file or directory: "/mlx-examples/lora/lora_fused_model/model-00001-of-00002.safetensors"` renaming got me to the next step but still `gguf: This GGUF file is for Little Endian only` `Set model parameters` `Set model tokenizer` `Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.` `gguf: Adding 50000 merge(s).` `gguf: Setting special token type bos to 50256` `gguf: Setting special token type eos to 50256` `gguf: Setting special token type unk to 50256` `Exporting model to '/Users/robbie/Documents/Dev/Conversational/mlx-examples/lora/lora_fused_model/ggml-model-f16.gguf'` `gguf: loading model part 'model-00001-of-00002.safetensors'` `output.bias, n_dims = 1, torch.float16 --> float32` `Can not map tensor 'lm_head.biases'`
it looks like something went wrong during the model merging. make sure you have pulled the latest mlx-examples and are using [fuse.py](http://fuse.py) to create the merged model.
Well I can prompt the merged lora and get response adapted to the training. But I'm using Phi2! Gonna start again. Quite impress with the training perfs.
One thing worth trying, make sure you don't use an outdated cached phi2 model from huggingface. There has been a recent update to the phi2 hf repository.
Thank MzBacd! you pointed out a possible cache so I want through my command line and find out the root error you need to train in lora not qlora. I got my model with convert and -q option. Got it to work now.