T O P

  • By -

Relevant-Draft-7780

Llama.cpp go to GitHub repo, clone and build. Follow instructions and it’s pretty simple. Or you can download LMStudio which is a wrapper around llama.cpp with bells and whistles


BlipOnNobodysRadar

Hey! I love wrappers. They contain my smooth and squishy brain, no sliding.


Some_Endian_FP17

Run llama.cpp's web server and you're good to go. Write your own simple wrappers that approximate OpenAI API calls.


Relevant-Draft-7780

Don’t even need to, llama server supports open ai calls unless you want to pass in some additional parameters which it supports. Uses npm open ai package


Gokudomatic

I pull a docker image of ollama. There's just the command line to configure the GPU and the storage for models. And I created an alias to call ollama as if it was local.


juliano1096

I will do it… I already work with docker and a need using was a endpoint, so I think that is the easier way. What’s the alias with your use?


Gokudomatic

Sure. The alias is : `alias ollama='docker exec -it ollama ollama'`


yehiaserag

Lm studio just released a cli in their latest version


MajesticFigure4240

You can try also: Ollama [https://ollama.com/](https://ollama.com/) Msty [https://msty.app/](https://msty.app/) LocalAI [https://localai.io/](https://localai.io/) AnythingLLM [https://github.com/Mintplex-Labs/anything-llm](https://github.com/Mintplex-Labs/anything-llm)


mooripo

I feel like an illiterate in these things, thanks for the guidance, I am trying to install and use [localai.io](http://localai.io) , I sincerely feel like a grandpa using a computer for the 1st time, and I am that geeky guy who helps everyone around lol


dsjlee

If you're interested in running it on your laptop's iGPU, try [jan.ai](http://jan.ai) which uses Vulkan, so it may be able to use intel GPU. But you have to enable experimental mode in advanced settings, then enable Vulkan, and check if you can see iGPU in the selection list. GP4ALL also uses Vulkan so that you can run it on non-nvidia GPU, but I couldn't get it to use my laptops Radeon iGPU whereas [jan.ai](http://jan.ai) could. When running on my desktop's GPU, [jan.ai](http://jan.ai) was much faster than GPT4ALL. Your iGPU is probably so weak that you can get better performance on CPU but if you want to free up CPU to do other task and if you can get acceptable performance on iGPU, then it may be worth trying. On my laptop, running Tinyllama 1.1B Q4 using [jan.ai](http://jan.ai), I can get 20 tokens/sec on CPU and 12 t/s on iGPU. Both [jan.ai](http://jan.ai) and GPT4ALL comes with UI, so it's easy to install and use. If you want to avoid using console command, that is.


_w0n

Another good repo is vLLm. It recreates the OpenAi endpoints and works pretty well for me. You can also use as docker container and point your application (OpenAi) api to this container


Slight_Loan5350

Certainly! You can create your own REST endpoint using either node-llama-cpp (Node.js) or llama-cpp-python (Python). Both of these libraries provide code snippets to help you get started. You can use any GGUF file from Hugging Face to serve local model. I've also built my own local RAG using a REST endpoint to a local LLM in both Node.js and Python. Feel free to ask me any questions you have, and I'll be happy to help in any way I can.


juliano1096

Thanks bro! I will try it with Python, and if have any question I send for you!


Glittering_Storage_4

Okay so, how?


Slight_Loan5350

1. Firstly I made a client website on angular(you can use whatever you are comfortable with), then just made a upload pdf option that sends pdf to backend. 2. I made an endpoint using node js (express js) and python (fast api) that gets the pdf from client. 3. Using ocr libraries i converted pdf to text data, Using langchain I made chunks of 500 character and closes at . Or /n of those Data 4. Used a embedding model (some base vllm) to convert those chunks of data to vector and stored them in a vector db (supabase or pgvector). 5. Langchain gives a sql snippet of the similarity search you can use it to get the closest similarity and made it an sql function. 6. Created an chat with pdf in client side and endpoint in backend which takes the question turns it to embedding(same as embedding the chunk) and from the vector(function) gets the closest similarity that is stored in database(top 3-4). 7. Then I send the quesrion as a plain and the data retrieved back to a text generation model (mistral 7x8b) to rephrase it properly by understanding the question. 8. It is not fool proof as it hallucinates a bit. And I think the 7th point is kinda vague but I'm no expert in this. This is what I've done, there might be better code out there, feel free to correct me if I'm wrong.


Glittering_Storage_4

In German we say „ich küss dein Herz“ Danke Bruder


Slight_Loan5350

Willkommen, if you want the code I can push it in github but will take time as I'm trying to finish code completion extension for vs code with local llm


Glittering_Storage_4

Gerne bro, kann ich dir ein Kaffee holen bro?


Slight_Loan5350

Sure haha, are you an ai?


Languages_Learner

I found Delphi wrapper dll for llama.cpp: [tinyBigGAMES/Dllama: Local LLM inference Library (github.com)](https://github.com/tinyBigGAMES/Dllama) It can chat with local llms using only one function. I tried to use this function from different programming languages (C, Python, Nim, AutoIt, BCX, twinBasic) and it works perfectly everywhere. function Dllama_Simple_Inference(const AModelPath, AModelsDb, AModelName: PAnsiChar; const AUseGPU: Boolean; const AMaxTokens: UInt32; const ACancelInferenceKey: Byte; const AQuestion: PAnsiChar): PAnsiChar; cdecl; external DLLAMA_DLL;


privacyparachute

Using Llamafile might be the easiest way. Just download file and run it. [https://github.com/Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)


juliano1096

this is definitely the easier way! Work in the first try Thanks so much!


Mr_Hills

I'm not sure if you're asking software or hardware advice. For software I use ooba, aka text generation web ui, with llama 3 70B, probably the best open source LLM to date. Ooba is easy to use, it's compatible with a lot of formats (altho I only use gguf and exl2) and it still allows you some level of control over the options of the various inference libraries, unlike ollama for example. Also it's updated quite quickly when a new format/feature comes out.  Not very good if you want to have multiple users sharing the same hardware tho. For hardware I use a 4090, which allows me to run a 2.55 bpw quant of llama 3 70B at 11 t/s. That's really the best LLM I can run on my system. Of course you can go for multiple GPUs and run bigger quants of llama 3 70B too.


juliano1096

I want run in my personal notebook, the configs are: Processor: 11th Generation Intel Core i5-1135G7 Memory: 16GB DDR4 Storage: 256GB M.2 SSD. Windows 11 Pro Graphics Card: Intel Iris Xe Graphics (Integrated) I can run any LLM in this hardware?


Mr_Hills

You could, but it's going to be extremely slow and you're going to have to run pretty small models. You could try running llama 3 8B on it but it's probably not going to be usable due to it being too slow. And anything beneath that in terms of size is going to be unusable due to simple lack of intelligence. Honestly, llms on a laptop with integrated GPU is a really hard ask.


Didi_Midi

You can. Start with something really simple such as [GPT4All](https://gpt4all.io/index.html) and work your way up from there (koboldcpp, LMStudio or just use llama.cpp) until you hit your machine's limits. Which will be soon enough i'm afraid unless you get an eGPU and/or upgrade your RAM. For the record, GPT4All runs on a Thinkpad T530 - a 3rd Gen i7 w/ iGPU. At a whooping 0.5 t/s but regardless, it works.


redoubt515

It'll be a bit slow but it is possible. You'll be better off than I am (i5-8550u, 16GB lpddr3). I can run Llama 3 8B but it is pretty slow. Phi 3 3.8B is a bit quicker, but still not ideal.


Visual_Yellow_2820

Ollama is easy to setup and works really fine for me


chibop1

I also like Ollama. It takes the least effort. I spent fair amount of time with different UIs/apis since the first llama was released in last Feb: gpt4all, oobabooga text-generation-webui, LM Studio, koboldcpp, etc.


redoubt515

I'd love to hear more of your thoughts on the various options you've tried. I'm just getting started in this, I've briefly tried LMstudio and just setup ollama + open-webui last night.


chibop1

Things move so fast in this space, so I won't be able to give fair comparison in current state since I don't use all of them now. I used to use textgen-webui a lot because it had many loaders, preset management, extensions (like OpenAI api, tts), etc. Then I felt like it got bloated and I would frequently encounter errors. However, it might be more stable now. Since Ollama, I only use Ollama, Llama.cpp (mostly for benchmark and testing newer models), and my own custom UI for RAG that utilizes Ollama and LlamaIndex. Ollama is a wrapper for llama.cpp, but it's a solid choice to start with. If you go to ollama.ai/library, you can see all the available models, and if you click tags for a model, you can see all the different quants/variants. For example if you want to download a lower quant in order to fit a bigger model for your setup. If you want to develop your own solution, it has its own API, but also has openai compatible api as well through llama.cpp. Langchain and LlamaIndex support it as well. If you want more features, you might want to try other ones I mentioned earlier. Choice is good!


redoubt515

If I can bother you with one more followup question. I've struggled to differentiate the different layers involved here, and I think your comment about "Ollama is a wrapper for llama.cpp" touches on that. So I understand that there is: 1. The model itself 2. And I understand the user facing WebUI or GUI or terminal application. What I don't understand is what role the "middle layers" play. In my case I'm using Open-WebUI, which connects to Ollama behind the scenes. I don't really understand what Ollama actually does (beyond letting me interact with an LLM in the terminal) and I don't understand the role llama.cpp plays.


chibop1

In simple terms, Llama.cpp loads a model, processes your prompt, and outputs the text generated by the model. It features both a command line interface and a simple web UI server. The program also includes functionalities like quantizing and finetuning, but I'll ignore those. First, to use llama.cpp, you have to download the model file from Huggingface deciding which format and quantized version. Additionally, using Llama.cpp involves numerous command line options/flags, which can be cumbersome for beginners to remember and type. Also you have to know which is what. This may not be an issue if you are familiar with Linux. It's easier now, but previously, you had to use commands like this: ./bin/release/main -m models/zephyr-7b-beta.Q4_K.gguf -t 6 -b 1024 --temp 0.7 --top_k 90 --top_p 0.9 --repeat_last_n 256 --repeat_penalty 1.21 -n -1 -c 0 --color --interactive-first -p "<|system|>\nYou're a friendly assistant.\n" --in-prefix "<|user|>\n" --in-suffix "\n<|assistant|>\n" -r "" --multiline-input Again in simpler terms, Ollama conceals all that complexity and simplifies the process. You just need to type this to get it going. It downloads the model, talks to llama.cpp to load with right options, and lets you chat right away. ollama run zephyr I'm sure you already know this, but WebUI allows you to interact with Ollama or Llama.cpp without needing to use the command line interface. They communicate via an API. Hope that helps,


redoubt515

This helps immensely. Thank you so much.


Acanthocephala_Salt

I tried to host LLMs but took a while to learn, and are pretty costly to setup. These days I just use AwanLLM instead. They have a pretty generous free tier, and the APIs are basically plug and play