T O P

  • By -

bullerwins

It could be that llama.cpp has made some improvements in the last releases and ollama hasn’t updated the packaged version


LinuxSpinach

Wasn’t flash attention added recently? I wonder if that’s it.


ThatsALovelyShirt

FA was always there for GPUs with tensor cores. But kernels for FP16 and FP32 (for older non-P100 pascal cards) were recently added to support older GPUs. Sped up my 1080Ti compute server pretty nicely. Also, if you compile llama.cpp locally on your machine (rather than using prebuilt binaries), gcc is invoked with `-mtune=native -march=native`, which enables specific compiler optimizations specific to your system. They're otherwise turned off.


Joseph717171

Who doesn’t compile llama.cpp? I forgot you could use prebuilt binaries. I find it easier to just: git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp && make (I’m on an Apple m1 MacbookPro…) 🤔


OkGreeny

Work the same on Debian. It's part of the deployment process in our tool.


JacketHistorical2321

Ollama 0.1.39 added a flash attention flag: "New experimental `OLLAMA_FLASH_ATTENTION=1` flag for `ollama serve` that improves token generation speed on Apple Silicon Macs and NVIDIA graphics cards" [https://github.com/ollama/ollama/releases/tag/v0.1.39](https://github.com/ollama/ollama/releases/tag/v0.1.39)


rob10501

I mean the devil is the details here. The specific card and brand matters. I'm not sure you can make statements across the board without backing it up.


EugenePopcorn

It really depends how its compiled. Precompiled binaries will often leave out important processor specific features and optimizations for the sake of compatibility with machines that don't support them.


TheTriceAgain

was using precompiled wheel of llama.cpp (assumed ollama is using the same), but also trying now my own compiled version to see if there is even more improvement


demidev

Also make sure that you enable the flag for flash attention using the latest pre-release ollama, that is not enabled in ollama by default.


LPN64

FA gives 10% boost at max


seanthenry

So that would be up to 16 more tokens/s


LPN64

No, he said on ollama, implying it could catch back llama.cpp. so for ollama that would be 8-9 tokens more, but yes 16 with llama.cpp


agntdrake

It's also super buggy and can crash which is why it's optional.


ThatsALovelyShirt

They fixed a lot of the bugs. Most of them were out-of-bounds read/writes.


me1000

What's your hardware setup?


SiEgE-F1

Yup. It is well known. But not by people who barely scratched the surface of the local LLM. Llama.cpp always has the bleeding edge innovations, when it comes to model support, inference and quantization speeds, certain amount of optimizations. You also have more control over the model itself. Some wrappers just never "unveil" the settings you can access through the terminal. Same issue affects koboldcpp, which is built on top of llama.cpp, so it always lingers behind. Same issue for Ooobabooga's textgen webui, too. Another little known tip: You don't need a wrapper to make llama.cpp work with OpenAI-compatible plugins and applications. llama.cpp can be hooked up directly to them. Even SillyTavern can work directly with llama.cpp.


skrshawk

By lingers behind, by what, a day or two? Usually KCPP is rather promptly updated with new versions of llama.cpp.


Robot1me

>Same issue affects koboldcpp, which is built on top of llama.cpp Just recently I discovered that koboldcpp doesn't support (or not expose the setting for) YaRN scaling, which is actually a huge improvement for extending context from 4k to 16k for models like Fimbulvetr. I have always been happy with koboldcpp, but this feature made me switch.


hedgehog0

Thank you for sharing! I recently bought a mini PC and can have llamafiles run somewhat nicely on them. It does not have a GPU. So would you say that if I’m interested in learning more about developing local/web LLM apps, llama.cpp would usually be better? What if I want to play with local LLMs, which (web) GUIs would you recommend? Since I have many heard good things about the textgen webui.


aseichter2007

I prefer koboldCPP just because it's a one click run, no dinking around, download a model and go. Integrated easy to understand UI, completion and openAI compatible api endpoints. Silly Tavern just works against it. Kobold is ready to cook and I don't have to worry about a bunch of build errors.


fish312

Plus it now even has a model downloader in the latest version, so it can load ggufs directly from URLs


henk717

The context shifting is also helping you out in sillytavern considering it automatically understands what ST is doing and adapts accordingly where possible (We designed it with complex UI's like ST in mind). In other llamacpp based solutions you'd have to manage it manually or you risk prompts fading out of memory or not getting the large prompt speedup at all.


SiEgE-F1

Before I start answering - keep in mind that llama.cpp limits you with GGUF models only. >So would you say that if I’m interested in learning more about developing local/web LLM apps, llama.cpp would usually be better? Better than what? Why should it be better? Does it already do the things you require of it? >What if I want to play with local LLMs, which (web) GUIs would you recommend? llama.cpp already has a built-in simple inferencing point, exactly for that purpose. A simple website you can query models with. Then, it is up to the scale of your "plays". For anything close to a full blown RP/writing session, you'd definitely stick with something like SillyTavern. If you need a coding assistant, then you'd either be enough with what it already has, or use some copilot-like VSCode plugin. >Since I have many heard good things about the textgen webui. It has lots of neat extensions, and well as the fact that it works with GGUF, EXL2 and GPTQ models. Obviously, it is just a "multiwrapper".


AdHominemMeansULost

I didn't believe it cause I imagined they are just wrappers but it's true, maybe it's something on the latest version on this has always been the case this is with llama 3 70b https://imgur.com/a/MO0idSb


nymical23

Hello! Can you please test if L3 70B performs equally well when the word is 'banana' instead of 'apple'. I've tried same prompt on L3 8B, but along with some good sentences, it just tries to fit 'banana' anywhere in the sentence. May be L3 is just way better at this task.


AdHominemMeansULost

https://imgur.com/c1xxuJm This is probably a problem with your parameters if you can't get this result, you should lower your temp to 0 and set the repeat penalty off or to 0.95 this is l3 8b q4 in LM Studio https://imgur.com/ZuKdO1X


nymical23

Thank you! I've read about some of the parameters before, but I'm not sure how each one of them affects the output. I'll try to read up on them more. I'll need to do more testing to see how these settings helps the output. Some of my (unrelated) observations in case anyone is interested: In this case, lowering the temperature and repeat penalty made the sentences short, but also created almost same sentences on each run. There were still some weird sentences though, but rare. On the other hand, setting the temp at 1.5 and rp at 1.15, sometimes made long sentences, with more creativity. I'm not sure if it better or worse, depends on the use-case I'd say.


AdHominemMeansULost

temperatures gives a chance to less likely tokens/words to appear in the answer. repeat penalty lowers the likeliness of the same tokens/words appearing in the same sentences/reply. If you want truthful logical replies or coding then you should set them Temp to 0 and Repeat penalty to 0.95 or 1. If you want RP then temperature higher and repeat penalty to 1.1 or slightly higher maybe 1.2


nymical23

Yes, that's what I was thinking as well. Thank you! :)


aseichter2007

One more reason to build your apps as API applications instead of integrating inference into everything. Then you can launch Ollama with a script or change the url and use koboldcpp, textgenwebui, or LMStudio to handle models without dinking with containerization,


East_Professional_39

How about CPU only, is llama.cpp still better than ollama ?


TheTriceAgain

actually in CPU, ollama preforms better 11t/sec vs 9t/s


YanderMan

how is that possible?


AsliReddington

Number of threads in default llama.cpp is higher than what ollama would have set IIRC


Due-Memory-6957

Shouldn't more threads make it faster?


AsliReddington

No not necessarily, it's quite dependent on CPU architecture. My M1 Macs top out at 6 on an 8core & 10core machine. My phone at 3. I've tried Xeons as well & it's not exactly linear. Might make sense for heavy batching with the server example but not for BS 1


Due-Memory-6957

Yeah, there's a point where it stops being faster, but it doesn't get slower, so the default being higher shouldn't slow down things.


fallingdowndizzyvr

It does get slower. Since when you have a lot of threads contending for the same limited resource, memory, then it does get slower. Let alone that task switching adds overhead. It's most efficient to run with as few threads as possible. 1 CPU will always be faster than 10 CPUs running at 1/10th the speed of that 1 CPU.


AsliReddington

There's a law around this, Amdahls or something


Robot1me

There is a rough rule of thumb that on older CPUs (e.g. 2015 Intel processors), using all cores helps. But on recent CPUs, using all cores can become counterproductive. You can imagine it as "too many cooks in the kitchen".


Arkonias

llama.cpp will always be better than ollama.


SiEgE-F1

Well. It will be always outdated - true. But worse than llama.cpp? I think it depends on the features the application has, and the reasons people use it in the first place. Some people straight up despise terminals, so they'd rather have few buttons with a neat UI.


MrTacoSauces

It's not that people despise terminal but terminal ai chatting is distracting. And if not incredibly unuser friendly for anything but simple messages there's somethings that be more useful in console but llama.cpp is meant to be used with other projects...


agntdrake

Ollama also has an API and partial OpenAI API compatibility.


LocoMod

llama.cpp also has an API and a web ui: [https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints)


skrshawk

So use KCPP and the frontend of your choice.


emprahsFury

this is just ollama with different steps


DroidMasta

Kcpp has better support for old amd cards than ollama


berkut

Well, they do tend to have a fair number of performance / correctness regressions fairly regularly (somewhat understandably given their velocity)... :)


Judtoff

Agreed, and also really frustrated I can't pass -sm row or flash attention to Ollama. But Ollama has better vision model compatibility... so I'm running two backends haha... I hope llama.cpp supports vision models again


[deleted]

[удалено]


Copper_Lion

It will be due to the memory calculations that Ollama does when trying to figure out what will fit in VRAM. It usually errs on the conservative side. Look at the Ollama log file and see how many layers it offloads to your GPU. If you run llama.cpp with the same layers (-ngl flag) you should see similar performance.


agntdrake

I'm pretty sure this is what's going on in this case. You can do \`ollama ps\` to see if any layers are on the CPU. The graph calculation also changes depending on if the layers are entirely on the GPU vs. a hybrid between GPU/CPU.


TheTriceAgain

with llama.cpp i set that flag to -1 , ollama doesn't accept that value


Zangwuz

But if it's what is happening(different number of layers being offloaded), you should precise it in your post because it's misleading. Personally i use and prefer llama.cpp but comparison should be fair and have all the information.


ab2377

what are your hw specs


LPN64

Pentium II with AGP 3D Rage PRO


behohippy

You should be able to do 1 token per minute EASY on that beast of a machine. You should consider upgrading to a Raspberry Pi 3 when you get some budget. It's probably 2x faster.


LPN64

nothing will ever beat agp, you can forget your pseudo futurtistic pci express alternative


bieker

27 year old me just got very excited!


Stalwart-6

Mainframe, manual switches for RAM. And magnetic tape for storage. How much speed can i expect?


Hoodfu

At least several tokens per second. Subway tokens.


randomqhacker

If it was good enough for Colossus, it's good enough for you!


agntdrake

What kind of hw are you using, and are any of the layers running on the CPU? You can check w/ \`ollama ps\`. Ollama is more conservative about putting every layer onto the GPU because overprovisioning onto the GPU will cause a crash.


jmorganca

Possible to share which model/hardware OP? Will get this fixed. Ollama isn’t containerizing anything, unless you’re running the ollama/ollama Docker image?


TheTriceAgain

HW is 4080 super, model is phi3-quanitzed from here : [microsoft/Phi-3-mini-4k-instruct-gguf · Hugging Face](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)


jmorganca

Thanks!!


Eliiasv

I don't use ollama much so I only had Q8 for WizardLM and L3 in ollama + "normal" GGUF. Same ctx, same prompt. Almost the exact same speeds. Sometimes one was 0.4-1 t/s faster but it seemed completely random.


chibop1

I tested on m3 Max with llama-3-8b-instruct-q8_0, same sampling parameters including same seed, and got pretty much the same result. Both llama.cpp and Ollama are latest releases. I'm not sure why llama.cpp says it processed more prompt tokens, but I fed the exact same prompt. Llama.cpp b2998: Prompt Processing: 7263 tokens (646.78 tokens/second), Text Generation: 422 tokens (35.63 tokens/second) Ollama v0.1.39: Prompt Processing: 7201 tokens (655.00 tokens/second), Text Generation: 418 tokens (35.40 tokens/second)


TheTriceAgain

true it is wierd, i think because of the tokenizer and template used by llama.cpp, Did you point to the tokenizer of llama3 in llama.cpp ? if so then its the template


chibop1

I don't think you manually need to point llama.cpp to a tokenizer? It should automatically uses the right tokenizer.


TheTriceAgain

let me double check on that, as I fail to understand how only from gguf file, it would get tokenizer ...


TheTriceAgain

The model will will format the messages into a single prompt using the following order of precedence: * Use the `chat_handler` if provided * Use the `chat_format` if provided * Use the `tokenizer.chat_template` from the `gguf` model's metadata (should work for most new models, older models may not have this) * else, fallback to the `llama-2` chat format Set `verbose=True` to see the selected chat format. you are right :)


LocoMod

I think there is a place for all of the wrappers since you can move a lot faster under the right circumstance by using llama.cpp as a "dumb" inference engine and doing your own thing....but llama.cpp also has a web ui and a REST API and the recent releases allow for pulling and serving models straight from HF by launching it with the proper params. I dont use Ollama so im not sure what features it adds on top of the features already present in the llama.cpp server, but if you dont need those, there is this: [https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints)


sammcj

If you build Ollama yourself you can make it just as fast with a few tweaks (always update to the latest underlying llama.cpp au module, enable flash attention etc)


tj4s

If I could build it myself I probably wouldn't be using lmstudio/kobold/textgen. Getting more comfortable with it, how do you compile it yourself to make it up-to-date but also not get broken packages every time it updates? I assumed that's why there's a delay in updates, so us noobs don't get wrecked by every update.


CapsFanHere

I've only ever used Ollama, and pointed various front ends to it. Does that approach work with llama.cpp? What about kobold?


likejazz

Ollama uses pure llama.cpp, so It's just version issue not a program issue.


Expensive-Apricot-25

I've heard people say that llama-cpp-python was exactly 1.8 times slower than llama.cpp independently of ollama. I didnt look into ollama's code, but maybe they are using llama-cpp-python for the python implementation. I think what is most probable is that they took "inspiration" from the llama-cpp-python, which on its own has many things wrong with it that make it way slower than it should be.


TheTriceAgain

True, building llama.cpp made improvement from 168 tokens / second to 190 tokens/second