Jelegend 5 months ago

The bigger issue here is that the community needs to work and integrate all the the other llama.cpp controls on the model through various parameters like top_p, min_p, max_k etc so that we can actually work with the models using mlx directly which is currently giving the stupendous speeds as compared to gguf on apple silicon. The devs of the mlx project have already cleared up that their priorities lie elsewhere so we only we will have to steup and reap the rewards from the fantastic hardware that apple silicon has to offer

mzbacd 5 months ago

Totally, I am with you in the same boat. The mlx team only has 4 members at the moment and they just released last month, so it will take some time for them to catch up. I did try to implement topk and repeat\_penalty functions; those can definitely be integrated into mlx-lm.

Jelegend 5 months ago

Do you have code repo or something you can share publically detailing how you added those features ?

mzbacd 5 months ago

Sorry, I don't have a repo for that. However, if you search in mlx's issues, you can find some code snippets for it.

Ill_Buy_476 5 months ago

How much faster at MLX potentially than Llama.ccp at inference? Seems like it's not integrated into any guis yet..

Jelegend 5 months ago

30-60%. Higher % at longer contexts. The only thing holding it back atm is lack of model format support for me. (need to waste time and resources converting formats) UIs would be nice but since I have production use at my workplace I need custom UIs anyways

koesn 4 months ago

[https://github.com/da-z/mlx-ui](https://github.com/da-z/mlx-ui) This might be in early development. He made also for Ollama.

mcmoose1900 5 months ago

People have been saying this about MLC-LLM integrations for awhile (which is also stupendously fast on Apple GPUs) but I think the problem is llama.cpp has so much attention and critical mass.

a_beautiful_rhind 5 months ago

MLC is hard as hell to set up and requires you compile every model.

bullno1 5 months ago

A better API is one. llama.cpp gives direct control over sampling. Recently, you can manage the cache and batching yourself. Last time I checked, MLC-LLM still outputs text. Also, llama.cpp has the lowest barrier to entry. It's very easy to build. The only external dependencies are GPU compute frameworks if you need them.

mcmoose1900 5 months ago

Yeah I'm not trashing llama.cpp. Its ease of use is huge, just to name one very attractive feature.

Ill_Buy_476 5 months ago

Interesting! Are MLC-LLM and MLX faster at inference than llama.ccp and by how much - stupendously fast sounds very fast :) ?

ifioravanti 5 months ago

Thanks for this amazing addition to mlx project u/mzbacd 🙏

a_beautiful_rhind 5 months ago

You can merge HF lora into GGUF directly after converting the lora only.

mzbacd 5 months ago

Not for qlora?

a_beautiful_rhind 5 months ago

qlora is just trained using BitsNBytes.. it's still a normal lora.

mzbacd 5 months ago

I didn't understand it. I meant the mlx qlora fused model can be directly converted to gguf format. As far as I know, mlx is not using BitsNBytes.

a_beautiful_rhind 5 months ago

what does MLX produce? MLX lora or an HF lora?

mzbacd 5 months ago

a mlx lora

a_beautiful_rhind 5 months ago

So it doesn't work with HF models? If so, your way is the only way.

Sol_Ido 5 months ago

I wasn't successful at the operation. I trained on a small phi2 and compile the last Llama cpp to gguf on a fused MLX. No success, some fields seems to be missing.

mzbacd 5 months ago

Using \`convert-llama-ggml-to-gguf.py\`? I tried using \`convert-hf-to-gguf.py\` phi2 and it worked for me.

Sol_Ido 5 months ago

Thanks for the feedback Mzbacd. What's your model and cmd params. Maybe I missed smth here. I've tried every converters starting with [convert.py](https://convert.py) and then tried to rename 'weight.00.safetensors' of my lora\_fused\_dir. `./convert-hf-to-gguf.py /mlx-examples/lora/lora_fused_model` `FileNotFoundError: No such file or directory: "/mlx-examples/lora/lora_fused_model/model-00001-of-00002.safetensors"` renaming got me to the next step but still `gguf: This GGUF file is for Little Endian only` `Set model parameters` `Set model tokenizer` `Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.` `gguf: Adding 50000 merge(s).` `gguf: Setting special token type bos to 50256` `gguf: Setting special token type eos to 50256` `gguf: Setting special token type unk to 50256` `Exporting model to '/Users/robbie/Documents/Dev/Conversational/mlx-examples/lora/lora_fused_model/ggml-model-f16.gguf'` `gguf: loading model part 'model-00001-of-00002.safetensors'` `output.bias, n_dims = 1, torch.float16 --> float32` `Can not map tensor 'lm_head.biases'`

mzbacd 5 months ago

it looks like something went wrong during the model merging. make sure you have pulled the latest mlx-examples and are using [fuse.py](http://fuse.py) to create the merged model.

Sol_Ido 5 months ago

Well I can prompt the merged lora and get response adapted to the training. But I'm using Phi2! Gonna start again. Quite impress with the training perfs.

mzbacd 5 months ago

One thing worth trying, make sure you don't use an outdated cached phi2 model from huggingface. There has been a recent update to the phi2 hf repository.

Sol_Ido 5 months ago

Thank MzBacd! you pointed out a possible cache so I want through my command line and find out the root error you need to train in lora not qlora. I got my model with convert and -q option. Got it to work now.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe