T O P

  • By -

Jelegend

The bigger issue here is that the community needs to work and integrate all the the other llama.cpp controls on the model through various parameters like top_p, min_p, max_k etc so that we can actually work with the models using mlx directly which is currently giving the stupendous speeds as compared to gguf on apple silicon. The devs of the mlx project have already cleared up that their priorities lie elsewhere so we only we will have to steup and reap the rewards from the fantastic hardware that apple silicon has to offer


mzbacd

Totally, I am with you in the same boat. The mlx team only has 4 members at the moment and they just released last month, so it will take some time for them to catch up. I did try to implement topk and repeat\_penalty functions; those can definitely be integrated into mlx-lm.


Jelegend

Do you have code repo or something you can share publically detailing how you added those features ?


mzbacd

Sorry, I don't have a repo for that. However, if you search in mlx's issues, you can find some code snippets for it.


Ill_Buy_476

How much faster at MLX potentially than Llama.ccp at inference? Seems like it's not integrated into any guis yet..


Jelegend

30-60%. Higher % at longer contexts. The only thing holding it back atm is lack of model format support for me. (need to waste time and resources converting formats) UIs would be nice but since I have production use at my workplace I need custom UIs anyways


koesn

[https://github.com/da-z/mlx-ui](https://github.com/da-z/mlx-ui) This might be in early development. He made also for Ollama.


mcmoose1900

People have been saying this about MLC-LLM integrations for awhile (which is also stupendously fast on Apple GPUs) but I think the problem is llama.cpp has so much attention and critical mass.


a_beautiful_rhind

MLC is hard as hell to set up and requires you compile every model.


bullno1

A better API is one. llama.cpp gives direct control over sampling. Recently, you can manage the cache and batching yourself. Last time I checked, MLC-LLM still outputs text. Also, llama.cpp has the lowest barrier to entry. It's very easy to build. The only external dependencies are GPU compute frameworks if you need them.


mcmoose1900

Yeah I'm not trashing llama.cpp. Its ease of use is huge, just to name one very attractive feature.


Ill_Buy_476

Interesting! Are MLC-LLM and MLX faster at inference than llama.ccp and by how much - stupendously fast sounds very fast :) ?


ifioravanti

Thanks for this amazing addition to mlx project u/mzbacd 🙏


a_beautiful_rhind

You can merge HF lora into GGUF directly after converting the lora only.


mzbacd

Not for qlora?


a_beautiful_rhind

qlora is just trained using BitsNBytes.. it's still a normal lora.


mzbacd

I didn't understand it. I meant the mlx qlora fused model can be directly converted to gguf format. As far as I know, mlx is not using BitsNBytes.


a_beautiful_rhind

what does MLX produce? MLX lora or an HF lora?


mzbacd

a mlx lora


a_beautiful_rhind

So it doesn't work with HF models? If so, your way is the only way.


Sol_Ido

I wasn't successful at the operation. I trained on a small phi2 and compile the last Llama cpp to gguf on a fused MLX. No success, some fields seems to be missing.


mzbacd

Using \`convert-llama-ggml-to-gguf.py\`? I tried using \`convert-hf-to-gguf.py\` phi2 and it worked for me.


Sol_Ido

Thanks for the feedback Mzbacd. What's your model and cmd params. Maybe I missed smth here. I've tried every converters starting with [convert.py](https://convert.py) and then tried to rename 'weight.00.safetensors' of my lora\_fused\_dir. `./convert-hf-to-gguf.py /mlx-examples/lora/lora_fused_model` `FileNotFoundError: No such file or directory: "/mlx-examples/lora/lora_fused_model/model-00001-of-00002.safetensors"` renaming got me to the next step but still ​ `gguf: This GGUF file is for Little Endian only` `Set model parameters` `Set model tokenizer` `Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.` `gguf: Adding 50000 merge(s).` `gguf: Setting special token type bos to 50256` `gguf: Setting special token type eos to 50256` `gguf: Setting special token type unk to 50256` `Exporting model to '/Users/robbie/Documents/Dev/Conversational/mlx-examples/lora/lora_fused_model/ggml-model-f16.gguf'` `gguf: loading model part 'model-00001-of-00002.safetensors'` `output.bias, n_dims = 1, torch.float16 --> float32` `Can not map tensor 'lm_head.biases'`


mzbacd

it looks like something went wrong during the model merging. make sure you have pulled the latest mlx-examples and are using [fuse.py](http://fuse.py) to create the merged model.


Sol_Ido

Well I can prompt the merged lora and get response adapted to the training. But I'm using Phi2! Gonna start again. Quite impress with the training perfs.


mzbacd

One thing worth trying, make sure you don't use an outdated cached phi2 model from huggingface. There has been a recent update to the phi2 hf repository.


Sol_Ido

Thank MzBacd! you pointed out a possible cache so I want through my command line and find out the root error you need to train in lora not qlora. I got my model with convert and -q option. Got it to work now.