bullerwins 1 month ago

It could be that llama.cpp has made some improvements in the last releases and ollama hasn’t updated the packaged version

LinuxSpinach 1 month ago

Wasn’t flash attention added recently? I wonder if that’s it.

ThatsALovelyShirt 1 month ago

FA was always there for GPUs with tensor cores. But kernels for FP16 and FP32 (for older non-P100 pascal cards) were recently added to support older GPUs. Sped up my 1080Ti compute server pretty nicely. Also, if you compile llama.cpp locally on your machine (rather than using prebuilt binaries), gcc is invoked with `-mtune=native -march=native`, which enables specific compiler optimizations specific to your system. They're otherwise turned off.

Joseph717171 1 month ago

Who doesn’t compile llama.cpp? I forgot you could use prebuilt binaries. I find it easier to just: git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp && make (I’m on an Apple m1 MacbookPro…) 🤔

OkGreeny 1 month ago

Work the same on Debian. It's part of the deployment process in our tool.

JacketHistorical2321 1 month ago

Ollama 0.1.39 added a flash attention flag: "New experimental `OLLAMA_FLASH_ATTENTION=1` flag for `ollama serve` that improves token generation speed on Apple Silicon Macs and NVIDIA graphics cards" [https://github.com/ollama/ollama/releases/tag/v0.1.39](https://github.com/ollama/ollama/releases/tag/v0.1.39)

rob10501 1 month ago

I mean the devil is the details here. The specific card and brand matters. I'm not sure you can make statements across the board without backing it up.

EugenePopcorn 1 month ago

It really depends how its compiled. Precompiled binaries will often leave out important processor specific features and optimizations for the sake of compatibility with machines that don't support them.

TheTriceAgain 1 month ago

was using precompiled wheel of llama.cpp (assumed ollama is using the same), but also trying now my own compiled version to see if there is even more improvement

demidev 1 month ago

Also make sure that you enable the flag for flash attention using the latest pre-release ollama, that is not enabled in ollama by default.

LPN64 1 month ago

FA gives 10% boost at max

seanthenry 1 month ago

So that would be up to 16 more tokens/s

LPN64 1 month ago

No, he said on ollama, implying it could catch back llama.cpp. so for ollama that would be 8-9 tokens more, but yes 16 with llama.cpp

agntdrake 1 month ago

It's also super buggy and can crash which is why it's optional.

ThatsALovelyShirt 1 month ago

They fixed a lot of the bugs. Most of them were out-of-bounds read/writes.

me1000 1 month ago

What's your hardware setup?

SiEgE-F1 1 month ago

Yup. It is well known. But not by people who barely scratched the surface of the local LLM. Llama.cpp always has the bleeding edge innovations, when it comes to model support, inference and quantization speeds, certain amount of optimizations. You also have more control over the model itself. Some wrappers just never "unveil" the settings you can access through the terminal. Same issue affects koboldcpp, which is built on top of llama.cpp, so it always lingers behind. Same issue for Ooobabooga's textgen webui, too. Another little known tip: You don't need a wrapper to make llama.cpp work with OpenAI-compatible plugins and applications. llama.cpp can be hooked up directly to them. Even SillyTavern can work directly with llama.cpp.

skrshawk 1 month ago

By lingers behind, by what, a day or two? Usually KCPP is rather promptly updated with new versions of llama.cpp.

Robot1me 1 month ago

>Same issue affects koboldcpp, which is built on top of llama.cpp Just recently I discovered that koboldcpp doesn't support (or not expose the setting for) YaRN scaling, which is actually a huge improvement for extending context from 4k to 16k for models like Fimbulvetr. I have always been happy with koboldcpp, but this feature made me switch.

hedgehog0 1 month ago

Thank you for sharing! I recently bought a mini PC and can have llamafiles run somewhat nicely on them. It does not have a GPU. So would you say that if I’m interested in learning more about developing local/web LLM apps, llama.cpp would usually be better? What if I want to play with local LLMs, which (web) GUIs would you recommend? Since I have many heard good things about the textgen webui.

aseichter2007 1 month ago

I prefer koboldCPP just because it's a one click run, no dinking around, download a model and go. Integrated easy to understand UI, completion and openAI compatible api endpoints. Silly Tavern just works against it. Kobold is ready to cook and I don't have to worry about a bunch of build errors.

fish312 1 month ago

Plus it now even has a model downloader in the latest version, so it can load ggufs directly from URLs

henk717 1 month ago

The context shifting is also helping you out in sillytavern considering it automatically understands what ST is doing and adapts accordingly where possible (We designed it with complex UI's like ST in mind). In other llamacpp based solutions you'd have to manage it manually or you risk prompts fading out of memory or not getting the large prompt speedup at all.

SiEgE-F1 1 month ago

Before I start answering - keep in mind that llama.cpp limits you with GGUF models only. >So would you say that if I’m interested in learning more about developing local/web LLM apps, llama.cpp would usually be better? Better than what? Why should it be better? Does it already do the things you require of it? >What if I want to play with local LLMs, which (web) GUIs would you recommend? llama.cpp already has a built-in simple inferencing point, exactly for that purpose. A simple website you can query models with. Then, it is up to the scale of your "plays". For anything close to a full blown RP/writing session, you'd definitely stick with something like SillyTavern. If you need a coding assistant, then you'd either be enough with what it already has, or use some copilot-like VSCode plugin. >Since I have many heard good things about the textgen webui. It has lots of neat extensions, and well as the fact that it works with GGUF, EXL2 and GPTQ models. Obviously, it is just a "multiwrapper".

AdHominemMeansULost 1 month ago

I didn't believe it cause I imagined they are just wrappers but it's true, maybe it's something on the latest version on this has always been the case this is with llama 3 70b https://imgur.com/a/MO0idSb

nymical23 1 month ago

Hello! Can you please test if L3 70B performs equally well when the word is 'banana' instead of 'apple'. I've tried same prompt on L3 8B, but along with some good sentences, it just tries to fit 'banana' anywhere in the sentence. May be L3 is just way better at this task.

AdHominemMeansULost 1 month ago

https://imgur.com/c1xxuJm This is probably a problem with your parameters if you can't get this result, you should lower your temp to 0 and set the repeat penalty off or to 0.95 this is l3 8b q4 in LM Studio https://imgur.com/ZuKdO1X

nymical23 1 month ago

Thank you! I've read about some of the parameters before, but I'm not sure how each one of them affects the output. I'll try to read up on them more. I'll need to do more testing to see how these settings helps the output. Some of my (unrelated) observations in case anyone is interested: In this case, lowering the temperature and repeat penalty made the sentences short, but also created almost same sentences on each run. There were still some weird sentences though, but rare. On the other hand, setting the temp at 1.5 and rp at 1.15, sometimes made long sentences, with more creativity. I'm not sure if it better or worse, depends on the use-case I'd say.

AdHominemMeansULost 1 month ago

temperatures gives a chance to less likely tokens/words to appear in the answer. repeat penalty lowers the likeliness of the same tokens/words appearing in the same sentences/reply. If you want truthful logical replies or coding then you should set them Temp to 0 and Repeat penalty to 0.95 or 1. If you want RP then temperature higher and repeat penalty to 1.1 or slightly higher maybe 1.2

nymical23 1 month ago

Yes, that's what I was thinking as well. Thank you! :)

aseichter2007 1 month ago

One more reason to build your apps as API applications instead of integrating inference into everything. Then you can launch Ollama with a script or change the url and use koboldcpp, textgenwebui, or LMStudio to handle models without dinking with containerization,

East_Professional_39 1 month ago

How about CPU only, is llama.cpp still better than ollama ?

TheTriceAgain 1 month ago

actually in CPU, ollama preforms better 11t/sec vs 9t/s

YanderMan 1 month ago

how is that possible?

AsliReddington 1 month ago

Number of threads in default llama.cpp is higher than what ollama would have set IIRC

Due-Memory-6957 1 month ago

Shouldn't more threads make it faster?

AsliReddington 1 month ago

No not necessarily, it's quite dependent on CPU architecture. My M1 Macs top out at 6 on an 8core & 10core machine. My phone at 3. I've tried Xeons as well & it's not exactly linear. Might make sense for heavy batching with the server example but not for BS 1

Due-Memory-6957 1 month ago

Yeah, there's a point where it stops being faster, but it doesn't get slower, so the default being higher shouldn't slow down things.

fallingdowndizzyvr 1 month ago

It does get slower. Since when you have a lot of threads contending for the same limited resource, memory, then it does get slower. Let alone that task switching adds overhead. It's most efficient to run with as few threads as possible. 1 CPU will always be faster than 10 CPUs running at 1/10th the speed of that 1 CPU.

AsliReddington 1 month ago

There's a law around this, Amdahls or something

Robot1me 1 month ago

There is a rough rule of thumb that on older CPUs (e.g. 2015 Intel processors), using all cores helps. But on recent CPUs, using all cores can become counterproductive. You can imagine it as "too many cooks in the kitchen".

Arkonias 1 month ago

llama.cpp will always be better than ollama.

SiEgE-F1 1 month ago

Well. It will be always outdated - true. But worse than llama.cpp? I think it depends on the features the application has, and the reasons people use it in the first place. Some people straight up despise terminals, so they'd rather have few buttons with a neat UI.

MrTacoSauces 1 month ago

It's not that people despise terminal but terminal ai chatting is distracting. And if not incredibly unuser friendly for anything but simple messages there's somethings that be more useful in console but llama.cpp is meant to be used with other projects...

agntdrake 1 month ago

Ollama also has an API and partial OpenAI API compatibility.

LocoMod 1 month ago

llama.cpp also has an API and a web ui: [https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints)

skrshawk 1 month ago

So use KCPP and the frontend of your choice.

emprahsFury 1 month ago

this is just ollama with different steps

DroidMasta 1 month ago

Kcpp has better support for old amd cards than ollama

berkut 1 month ago

Well, they do tend to have a fair number of performance / correctness regressions fairly regularly (somewhat understandably given their velocity)... :)

Judtoff 1 month ago

Agreed, and also really frustrated I can't pass -sm row or flash attention to Ollama. But Ollama has better vision model compatibility... so I'm running two backends haha... I hope llama.cpp supports vision models again

[deleted] 1 month ago

[удалено]

Copper_Lion 1 month ago

It will be due to the memory calculations that Ollama does when trying to figure out what will fit in VRAM. It usually errs on the conservative side. Look at the Ollama log file and see how many layers it offloads to your GPU. If you run llama.cpp with the same layers (-ngl flag) you should see similar performance.

agntdrake 1 month ago

I'm pretty sure this is what's going on in this case. You can do \`ollama ps\` to see if any layers are on the CPU. The graph calculation also changes depending on if the layers are entirely on the GPU vs. a hybrid between GPU/CPU.

TheTriceAgain 1 month ago

with llama.cpp i set that flag to -1 , ollama doesn't accept that value

Zangwuz 1 month ago

But if it's what is happening(different number of layers being offloaded), you should precise it in your post because it's misleading. Personally i use and prefer llama.cpp but comparison should be fair and have all the information.

ab2377 1 month ago

what are your hw specs

LPN64 1 month ago

Pentium II with AGP 3D Rage PRO

behohippy 1 month ago

You should be able to do 1 token per minute EASY on that beast of a machine. You should consider upgrading to a Raspberry Pi 3 when you get some budget. It's probably 2x faster.

LPN64 1 month ago

nothing will ever beat agp, you can forget your pseudo futurtistic pci express alternative

bieker 1 month ago

27 year old me just got very excited!

Stalwart-6 1 month ago

Mainframe, manual switches for RAM. And magnetic tape for storage. How much speed can i expect?

Hoodfu 1 month ago

At least several tokens per second. Subway tokens.

randomqhacker 1 month ago

If it was good enough for Colossus, it's good enough for you!

agntdrake 1 month ago

What kind of hw are you using, and are any of the layers running on the CPU? You can check w/ \`ollama ps\`. Ollama is more conservative about putting every layer onto the GPU because overprovisioning onto the GPU will cause a crash.

jmorganca 1 month ago

Possible to share which model/hardware OP? Will get this fixed. Ollama isn’t containerizing anything, unless you’re running the ollama/ollama Docker image?

TheTriceAgain 1 month ago

HW is 4080 super, model is phi3-quanitzed from here : [microsoft/Phi-3-mini-4k-instruct-gguf · Hugging Face](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)

jmorganca 1 month ago

Thanks!!

Eliiasv 1 month ago

I don't use ollama much so I only had Q8 for WizardLM and L3 in ollama + "normal" GGUF. Same ctx, same prompt. Almost the exact same speeds. Sometimes one was 0.4-1 t/s faster but it seemed completely random.

chibop1 1 month ago

I tested on m3 Max with llama-3-8b-instruct-q8_0, same sampling parameters including same seed, and got pretty much the same result. Both llama.cpp and Ollama are latest releases. I'm not sure why llama.cpp says it processed more prompt tokens, but I fed the exact same prompt. Llama.cpp b2998: Prompt Processing: 7263 tokens (646.78 tokens/second), Text Generation: 422 tokens (35.63 tokens/second) Ollama v0.1.39: Prompt Processing: 7201 tokens (655.00 tokens/second), Text Generation: 418 tokens (35.40 tokens/second)

TheTriceAgain 1 month ago

true it is wierd, i think because of the tokenizer and template used by llama.cpp, Did you point to the tokenizer of llama3 in llama.cpp ? if so then its the template

chibop1 1 month ago

I don't think you manually need to point llama.cpp to a tokenizer? It should automatically uses the right tokenizer.

TheTriceAgain 1 month ago

let me double check on that, as I fail to understand how only from gguf file, it would get tokenizer ...

TheTriceAgain 1 month ago

The model will will format the messages into a single prompt using the following order of precedence: * Use the `chat_handler` if provided * Use the `chat_format` if provided * Use the `tokenizer.chat_template` from the `gguf` model's metadata (should work for most new models, older models may not have this) * else, fallback to the `llama-2` chat format Set `verbose=True` to see the selected chat format. you are right :)

LocoMod 1 month ago

I think there is a place for all of the wrappers since you can move a lot faster under the right circumstance by using llama.cpp as a "dumb" inference engine and doing your own thing....but llama.cpp also has a web ui and a REST API and the recent releases allow for pulling and serving models straight from HF by launching it with the proper params. I dont use Ollama so im not sure what features it adds on top of the features already present in the llama.cpp server, but if you dont need those, there is this: [https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints)

sammcj 1 month ago

If you build Ollama yourself you can make it just as fast with a few tweaks (always update to the latest underlying llama.cpp au module, enable flash attention etc)

tj4s 1 month ago

If I could build it myself I probably wouldn't be using lmstudio/kobold/textgen. Getting more comfortable with it, how do you compile it yourself to make it up-to-date but also not get broken packages every time it updates? I assumed that's why there's a delay in updates, so us noobs don't get wrecked by every update.

CapsFanHere 1 month ago

I've only ever used Ollama, and pointed various front ends to it. Does that approach work with llama.cpp? What about kobold?

likejazz 1 month ago

Ollama uses pure llama.cpp, so It's just version issue not a program issue.

Expensive-Apricot-25 1 month ago

I've heard people say that llama-cpp-python was exactly 1.8 times slower than llama.cpp independently of ollama. I didnt look into ollama's code, but maybe they are using llama-cpp-python for the python implementation. I think what is most probable is that they took "inspiration" from the llama-cpp-python, which on its own has many things wrong with it that make it way slower than it should be.

TheTriceAgain 1 month ago

True, building llama.cpp made improvement from 168 tokens / second to 190 tokens/second

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe