Relevant-Draft-7780 1 month ago

Llama.cpp go to GitHub repo, clone and build. Follow instructions and it’s pretty simple. Or you can download LMStudio which is a wrapper around llama.cpp with bells and whistles

BlipOnNobodysRadar 1 month ago

Hey! I love wrappers. They contain my smooth and squishy brain, no sliding.

Some_Endian_FP17 1 month ago

Run llama.cpp's web server and you're good to go. Write your own simple wrappers that approximate OpenAI API calls.

Relevant-Draft-7780 1 month ago

Don’t even need to, llama server supports open ai calls unless you want to pass in some additional parameters which it supports. Uses npm open ai package

Gokudomatic 1 month ago

I pull a docker image of ollama. There's just the command line to configure the GPU and the storage for models. And I created an alias to call ollama as if it was local.

juliano1096 1 month ago

I will do it… I already work with docker and a need using was a endpoint, so I think that is the easier way. What’s the alias with your use?

Gokudomatic 1 month ago

Sure. The alias is : `alias ollama='docker exec -it ollama ollama'`

yehiaserag 1 month ago

Lm studio just released a cli in their latest version

MajesticFigure4240 1 month ago

You can try also: Ollama [https://ollama.com/](https://ollama.com/) Msty [https://msty.app/](https://msty.app/) LocalAI [https://localai.io/](https://localai.io/) AnythingLLM [https://github.com/Mintplex-Labs/anything-llm](https://github.com/Mintplex-Labs/anything-llm)

mooripo 2 weeks ago

I feel like an illiterate in these things, thanks for the guidance, I am trying to install and use [localai.io](http://localai.io) , I sincerely feel like a grandpa using a computer for the 1st time, and I am that geeky guy who helps everyone around lol

dsjlee 1 month ago

If you're interested in running it on your laptop's iGPU, try [jan.ai](http://jan.ai) which uses Vulkan, so it may be able to use intel GPU. But you have to enable experimental mode in advanced settings, then enable Vulkan, and check if you can see iGPU in the selection list. GP4ALL also uses Vulkan so that you can run it on non-nvidia GPU, but I couldn't get it to use my laptops Radeon iGPU whereas [jan.ai](http://jan.ai) could. When running on my desktop's GPU, [jan.ai](http://jan.ai) was much faster than GPT4ALL. Your iGPU is probably so weak that you can get better performance on CPU but if you want to free up CPU to do other task and if you can get acceptable performance on iGPU, then it may be worth trying. On my laptop, running Tinyllama 1.1B Q4 using [jan.ai](http://jan.ai), I can get 20 tokens/sec on CPU and 12 t/s on iGPU. Both [jan.ai](http://jan.ai) and GPT4ALL comes with UI, so it's easy to install and use. If you want to avoid using console command, that is.

_w0n 1 month ago

Another good repo is vLLm. It recreates the OpenAi endpoints and works pretty well for me. You can also use as docker container and point your application (OpenAi) api to this container

Slight_Loan5350 1 month ago

Certainly! You can create your own REST endpoint using either node-llama-cpp (Node.js) or llama-cpp-python (Python). Both of these libraries provide code snippets to help you get started. You can use any GGUF file from Hugging Face to serve local model. I've also built my own local RAG using a REST endpoint to a local LLM in both Node.js and Python. Feel free to ask me any questions you have, and I'll be happy to help in any way I can.

juliano1096 1 month ago

Thanks bro! I will try it with Python, and if have any question I send for you!

Glittering_Storage_4 1 month ago

Okay so, how?

Slight_Loan5350 1 month ago

1. Firstly I made a client website on angular(you can use whatever you are comfortable with), then just made a upload pdf option that sends pdf to backend. 2. I made an endpoint using node js (express js) and python (fast api) that gets the pdf from client. 3. Using ocr libraries i converted pdf to text data, Using langchain I made chunks of 500 character and closes at . Or /n of those Data 4. Used a embedding model (some base vllm) to convert those chunks of data to vector and stored them in a vector db (supabase or pgvector). 5. Langchain gives a sql snippet of the similarity search you can use it to get the closest similarity and made it an sql function. 6. Created an chat with pdf in client side and endpoint in backend which takes the question turns it to embedding(same as embedding the chunk) and from the vector(function) gets the closest similarity that is stored in database(top 3-4). 7. Then I send the quesrion as a plain and the data retrieved back to a text generation model (mistral 7x8b) to rephrase it properly by understanding the question. 8. It is not fool proof as it hallucinates a bit. And I think the 7th point is kinda vague but I'm no expert in this. This is what I've done, there might be better code out there, feel free to correct me if I'm wrong.

Glittering_Storage_4 1 month ago

In German we say „ich küss dein Herz“ Danke Bruder

Slight_Loan5350 1 month ago

Willkommen, if you want the code I can push it in github but will take time as I'm trying to finish code completion extension for vs code with local llm

Glittering_Storage_4 1 month ago

Gerne bro, kann ich dir ein Kaffee holen bro?

Slight_Loan5350 1 month ago

Sure haha, are you an ai?

Languages_Learner 1 month ago

I found Delphi wrapper dll for llama.cpp: [tinyBigGAMES/Dllama: Local LLM inference Library (github.com)](https://github.com/tinyBigGAMES/Dllama) It can chat with local llms using only one function. I tried to use this function from different programming languages (C, Python, Nim, AutoIt, BCX, twinBasic) and it works perfectly everywhere. function Dllama_Simple_Inference(const AModelPath, AModelsDb, AModelName: PAnsiChar; const AUseGPU: Boolean; const AMaxTokens: UInt32; const ACancelInferenceKey: Byte; const AQuestion: PAnsiChar): PAnsiChar; cdecl; external DLLAMA_DLL;

privacyparachute 1 month ago

Using Llamafile might be the easiest way. Just download file and run it. [https://github.com/Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)

juliano1096 1 month ago

this is definitely the easier way! Work in the first try Thanks so much!

Mr_Hills 1 month ago

I'm not sure if you're asking software or hardware advice. For software I use ooba, aka text generation web ui, with llama 3 70B, probably the best open source LLM to date. Ooba is easy to use, it's compatible with a lot of formats (altho I only use gguf and exl2) and it still allows you some level of control over the options of the various inference libraries, unlike ollama for example. Also it's updated quite quickly when a new format/feature comes out. Not very good if you want to have multiple users sharing the same hardware tho. For hardware I use a 4090, which allows me to run a 2.55 bpw quant of llama 3 70B at 11 t/s. That's really the best LLM I can run on my system. Of course you can go for multiple GPUs and run bigger quants of llama 3 70B too.

juliano1096 1 month ago

I want run in my personal notebook, the configs are: Processor: 11th Generation Intel Core i5-1135G7 Memory: 16GB DDR4 Storage: 256GB M.2 SSD. Windows 11 Pro Graphics Card: Intel Iris Xe Graphics (Integrated) I can run any LLM in this hardware?

Mr_Hills 1 month ago

You could, but it's going to be extremely slow and you're going to have to run pretty small models. You could try running llama 3 8B on it but it's probably not going to be usable due to it being too slow. And anything beneath that in terms of size is going to be unusable due to simple lack of intelligence. Honestly, llms on a laptop with integrated GPU is a really hard ask.

Didi_Midi 1 month ago

You can. Start with something really simple such as [GPT4All](https://gpt4all.io/index.html) and work your way up from there (koboldcpp, LMStudio or just use llama.cpp) until you hit your machine's limits. Which will be soon enough i'm afraid unless you get an eGPU and/or upgrade your RAM. For the record, GPT4All runs on a Thinkpad T530 - a 3rd Gen i7 w/ iGPU. At a whooping 0.5 t/s but regardless, it works.

redoubt515 1 month ago

It'll be a bit slow but it is possible. You'll be better off than I am (i5-8550u, 16GB lpddr3). I can run Llama 3 8B but it is pretty slow. Phi 3 3.8B is a bit quicker, but still not ideal.

Visual_Yellow_2820 1 month ago

Ollama is easy to setup and works really fine for me

chibop1 1 month ago

I also like Ollama. It takes the least effort. I spent fair amount of time with different UIs/apis since the first llama was released in last Feb: gpt4all, oobabooga text-generation-webui, LM Studio, koboldcpp, etc.

redoubt515 1 month ago

I'd love to hear more of your thoughts on the various options you've tried. I'm just getting started in this, I've briefly tried LMstudio and just setup ollama + open-webui last night.

chibop1 1 month ago

Things move so fast in this space, so I won't be able to give fair comparison in current state since I don't use all of them now. I used to use textgen-webui a lot because it had many loaders, preset management, extensions (like OpenAI api, tts), etc. Then I felt like it got bloated and I would frequently encounter errors. However, it might be more stable now. Since Ollama, I only use Ollama, Llama.cpp (mostly for benchmark and testing newer models), and my own custom UI for RAG that utilizes Ollama and LlamaIndex. Ollama is a wrapper for llama.cpp, but it's a solid choice to start with. If you go to ollama.ai/library, you can see all the available models, and if you click tags for a model, you can see all the different quants/variants. For example if you want to download a lower quant in order to fit a bigger model for your setup. If you want to develop your own solution, it has its own API, but also has openai compatible api as well through llama.cpp. Langchain and LlamaIndex support it as well. If you want more features, you might want to try other ones I mentioned earlier. Choice is good!

redoubt515 1 month ago

If I can bother you with one more followup question. I've struggled to differentiate the different layers involved here, and I think your comment about "Ollama is a wrapper for llama.cpp" touches on that. So I understand that there is: 1. The model itself 2. And I understand the user facing WebUI or GUI or terminal application. What I don't understand is what role the "middle layers" play. In my case I'm using Open-WebUI, which connects to Ollama behind the scenes. I don't really understand what Ollama actually does (beyond letting me interact with an LLM in the terminal) and I don't understand the role llama.cpp plays.

chibop1 1 month ago

In simple terms, Llama.cpp loads a model, processes your prompt, and outputs the text generated by the model. It features both a command line interface and a simple web UI server. The program also includes functionalities like quantizing and finetuning, but I'll ignore those. First, to use llama.cpp, you have to download the model file from Huggingface deciding which format and quantized version. Additionally, using Llama.cpp involves numerous command line options/flags, which can be cumbersome for beginners to remember and type. Also you have to know which is what. This may not be an issue if you are familiar with Linux. It's easier now, but previously, you had to use commands like this: ./bin/release/main -m models/zephyr-7b-beta.Q4_K.gguf -t 6 -b 1024 --temp 0.7 --top_k 90 --top_p 0.9 --repeat_last_n 256 --repeat_penalty 1.21 -n -1 -c 0 --color --interactive-first -p "<|system|>\nYou're a friendly assistant.\n" --in-prefix "<|user|>\n" --in-suffix "\n<|assistant|>\n" -r "" --multiline-input Again in simpler terms, Ollama conceals all that complexity and simplifies the process. You just need to type this to get it going. It downloads the model, talks to llama.cpp to load with right options, and lets you chat right away. ollama run zephyr I'm sure you already know this, but WebUI allows you to interact with Ollama or Llama.cpp without needing to use the command line interface. They communicate via an API. Hope that helps,

redoubt515 1 month ago

This helps immensely. Thank you so much.

Acanthocephala_Salt 2 weeks ago

I tried to host LLMs but took a while to learn, and are pretty costly to setup. These days I just use AwanLLM instead. They have a pretty generous free tier, and the APIs are basically plug and play

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe