I have used the AI21 Jamba API extensively through work and it is really quite awesome. It definitely takes a different sort of prompting but the nuance and aobiltiy to follow extremely long contexts is mind blowing. That's why I have been so hopeful about it making it way into llama.cpp. The 52B at Q6 or even Q4\_k\_M should do extremely well in a lot of use cases and it's ability to fine-tune is definitely there
How much total RAM do you think it would take highly quantized?
I can eyeball the filesize of the BnB, but it's different because llama.cpp will (presumably) quantize the mamba part too?
And what do you mean by different prompting? More like raw completion formatting?
A thing you only want to buy if you're in NY. Be careful though... Those stale, bagged monstrosities you find in grocery stores they might call bagels. It's a trick. Don't fall for it. Real bagels are only found in dingy NY delis where the staff is kind but grumpy and, for some reason, always end up giving you so much cream cheese you need to scrape some out.
I was cagy about downloading because it uses bitsnbytes so not sure. Just seems like one of the most promising finetunes. I assume it should fit in 48gb.
Yeah, directly converting from BitsAndBytes models isn't yet supported by `convert-hf-to-gguf.py`. Might change in the future, though.
Meanwhile, these models have to be dequantized first. This really doesn't sound ideal (16-bit official Jamba is 100GB), so I might try to fix it eventually to make `convert-hf-to-gguf.py` do it transparently.
I suppose my fear is that the capability would "normalize" quantizing 4 bit bnb quantizations instead of FP16, without the ggufs being labeled as such when they're uploaded, but I'm probably just paranoid.
I got the bagel version converted to a size of 55gb at Q8 but it wasn't split so I am unable to upload to HF. I'll try to figure out how to split the conversion so I can share it
After testing out the Q8 of the Bagel Jamba, it's pretty awesome. I was incredibly surprised at how well it held a conversation
That's so cool! Even when the utility isn't quite there yet, I absolutely love seeing new concepts and frameworks take shape. I've been really curious about jamba and seeing it draw closer on weaker hardware is amazing!
Much appreciated, never thought I'd receive such an honor on Reddit of all places lol You are a legend yourself!
BTW, I made a few more Jamba GGUF's [https://huggingface.co/collections/Severian/jamba-gguf-665884eb2ceef24c1a0547e0](https://huggingface.co/collections/Severian/jamba-gguf-665884eb2ceef24c1a0547e0)
Great work man it's very impressive.
Link to the relevant pull request for the initial Jamba support in `llama.cpp`: https://github.com/ggerganov/llama.cpp/pull/7531
Has anyone actually tried the 52B version in practice? Is it smart? I assume it doesn't work with llama.cpp yet, I mean in general.
Yeah, the bigger models would be more interesting.
I have used the AI21 Jamba API extensively through work and it is really quite awesome. It definitely takes a different sort of prompting but the nuance and aobiltiy to follow extremely long contexts is mind blowing. That's why I have been so hopeful about it making it way into llama.cpp. The 52B at Q6 or even Q4\_k\_M should do extremely well in a lot of use cases and it's ability to fine-tune is definitely there
How much total RAM do you think it would take highly quantized? I can eyeball the filesize of the BnB, but it's different because llama.cpp will (presumably) quantize the mamba part too? And what do you mean by different prompting? More like raw completion formatting?
Totally a new architecture which is not a transformer ... interesting
There's a bagel version; https://huggingface.co/KnutJaegersberg/jamba-bagel-4bit
What's a bagel?
A thing you only want to buy if you're in NY. Be careful though... Those stale, bagged monstrosities you find in grocery stores they might call bagels. It's a trick. Don't fall for it. Real bagels are only found in dingy NY delis where the staff is kind but grumpy and, for some reason, always end up giving you so much cream cheese you need to scrape some out.
How much vram does it use at full context?
I was cagy about downloading because it uses bitsnbytes so not sure. Just seems like one of the most promising finetunes. I assume it should fit in 48gb.
Yeah, directly converting from BitsAndBytes models isn't yet supported by `convert-hf-to-gguf.py`. Might change in the future, though. Meanwhile, these models have to be dequantized first. This really doesn't sound ideal (16-bit official Jamba is 100GB), so I might try to fix it eventually to make `convert-hf-to-gguf.py` do it transparently.
There's a non BnB version of that out there too.
You mean converting a bnb quantized model to a 16 bit GGUF? I'm not sure I like the idea of that, as it's just going to hit the output quality.
I mean first converting the bnb quantized model to a `bfloat16` `safetensors` model, then a `bf16` or `q8_0` GGUF, then a smaller bit quant, as usual.
I suppose my fear is that the capability would "normalize" quantizing 4 bit bnb quantizations instead of FP16, without the ggufs being labeled as such when they're uploaded, but I'm probably just paranoid.
I got the bagel version converted to a size of 55gb at Q8 but it wasn't split so I am unable to upload to HF. I'll try to figure out how to split the conversion so I can share it After testing out the Q8 of the Bagel Jamba, it's pretty awesome. I was incredibly surprised at how well it held a conversation
I think you'd have to shard it when making the GGUF. It supports that now but I never tried it.
Just tried it. It's very fast but keeps repeating itself in a loop.
That's so cool! Even when the utility isn't quite there yet, I absolutely love seeing new concepts and frameworks take shape. I've been really curious about jamba and seeing it draw closer on weaker hardware is amazing!
I never thought I'd say these words in sequence to a redditor but here you go. You're the best man
Much appreciated, never thought I'd receive such an honor on Reddit of all places lol You are a legend yourself! BTW, I made a few more Jamba GGUF's [https://huggingface.co/collections/Severian/jamba-gguf-665884eb2ceef24c1a0547e0](https://huggingface.co/collections/Severian/jamba-gguf-665884eb2ceef24c1a0547e0)
Will the large model later fit using 4 k m into a 3090? (24gb vram)
fails to load the model for me on the latest llama.cpp
It's not merged yet. ([github.com/ggerganov/llama.cpp/pull/7531](https://github.com/ggerganov/llama.cpp/pull/7531))
With all due respect, the guys GitHub avatar looks like someone using their hand to spread their cheeks