Made a RAG-centric, Open-Source UI based on llama.cpp - With Advanced Source Citations & Referencing: Pinpointing Page-Numbers, Incorporating Extracted Images, Text-highlighting & Document-Readers alongside Local LLM-generated Responses #7928

abgulati · 2024-06-13T22:40:15Z

abgulati
Jun 13, 2024

Recently Open-Sourced my Citation-Centric Local-LLM Application: RAG with your LLM of choice, with your documents, on your machine.

Please read-on for more details, or check out my GitHub repo (attained 200 stars in under a week!) for the complete scoop!

Introducing LARS: The LLM & Advanced Referencing Solution! There are many desktop applications for running LLMs locally, but LARS aims to be the ultimate open-source RAG-centric LLM application.

Towards this end, LARS takes the concept of RAG much further by adding detailed citations to every response, supplying you with specific document names, page numbers, text-highlighting, and images relevant to your question, and even presenting a document reader right within the response window. While all the citations are not always present for every response, the idea is to have at least some combination of citations brought up for every RAG response and that’s generally found to be the case.

I humbly request the addition of LARS to the UI list on the llama.cpp README documentation!

Here's a demonstration video going over core features:

LARS Feature-Demonstration Video

Here's a list detailing LARS's feature-set as it stands today:

Advanced Citations: The main showcase feature of LARS - LLM-generated responses are appended with detailed citations comprising document names, page numbers, text highlighting and image extraction for any RAG centric responses, with a document reader presented for the user to scroll through the document right within the response window and download highlighted PDFs
Vast number of supported file-formats:
- PDFs
- Word files: doc, docx, odt, rtf, txt
- Excel files: xls, xlsx, ods, csv
- PowerPoint presentations: ppt, pptx, odp
- Image files: bmp, gif, jpg, png, svg, tiff
- Rich Text Format (RTF)
- HTML files
Conversion memory: Users can ask follow-up questions, including for prior conversations
Full chat-history: Users can go back and resume prior conversations
Users can force enable or disable RAG at any time via Settings
Users can change the system prompt at any time via Settings
Drag-and-drop in new LLMs - change LLM's via Settings at any time
Built-in prompt-templates for the most popular LLMs and then some: Llama3, Llama2, ChatML, Phi3, Command-R, Deepseek Coder, Vicuna and OpenChat-3.5
Pure llama.cpp backend - No frameworks, no Python-bindings, no abstractions - just pure llama.cpp! Upgrade to newer versions of llama.cpp independent of LARS
GPU-accelerated inferencing: Nvidia CUDA-accelerated inferencing supported
Tweak advanced LLM settings - Change LLM temperature, top-k, top-p, min-p, n-keep, set the number of model layers to be offloaded to the GPU, and enable or disable the use of GPUs, all via Settings at any time
Four embedding models - sentence-transformers/all-mpnet-base-v2, BGE-Base, BGE-Large, OpenAI Text-Ada
Sources UI - A table is displayed for the selected embedding model detailing the documents that have been uploaded to LARS, including vectorization details such as chunk_size and chunk_overlap
A reset button is provided to empty and reset the vectorDB
Three text extraction methods: a purely local text-extraction option and two OCR options via Azure for better accuracy and scanned document support - Azure ComputerVision OCR has an always free-tier
A custom parser for the Azure AI Document-Intelligence OCR service for enhanced table-data extraction while preventing double-text by accounting for the spatial coordinates of the extracted text

I have been building this tool single-handedly since August 2023 and am continuing to add to it on a near daily basis.

LARS could really benefit from users, testing and contributions so please check out the repository!

You can also check out my post sharing & discussing LARS on Reddit.

ggerganov · 2024-06-14T08:07:51Z

ggerganov
Jun 14, 2024
Maintainer

Thanks for sharing - looks like a cool project!

I humbly request the addition of LARS to the UI list on the llama.cpp README documentation!

Absolutely, please open a PR

Curious to know more about your experience using llama.cpp and what could be improved. It seems you are mostly interfacing through the existing server example, is that correct? Is there some extra functionality that you would find useful? Did you have to make some custom changes to the server to make it work for your use case? Regarding the embedding models - do you compute all these through the server as well?

5 replies

abgulati Jun 14, 2024
Author

Wow I'm honestly thrilled to hear you liked it and your starring the LARS repo made my day!! Also thank you so much for getting back to me so quickly!! I'm always amazed at the pace and quality of work that goes into llama.cpp, truly you & the team here have my utmost admiration!

You're correct that I'm primarily interfacing with llama.cpp via server. The typical workflow is as follows:

The user simply downloads LLMs of their choice and places them in the LARS's models dir, they then show up in the LLM-Selection dropdown in the Setting menu in HTML/JS frontend UI
The user proceeds to select their desired LLM from here, including setting the correct prompt template, GPU settings (enabled/disabled, number of layers to offload), context length, token generation limit, top-k/p etc etc and hits Save
This triggers an API call to the Python-Flask backend-server to store these settings in config.json
The Python-Flask server then proceeds to act as a "manager" for the llama.cpp server, lauching it with the appropriate parameters:

cpp_app = ['server', '-m', cpp_model, '-ngl', str(local_llm_gpu_layers), '-c', str(local_llm_context_length), '-n', str(local_llm_max_new_tokens), '--host', '0.0.0.0']

LLAMA_CPP_PROCESS = subprocess.Popen(cpp_app, creationflags=subprocess.CREATE_NEW_CONSOLE) #there's a condition check that does this slightly differently on non-Windows platforms!

The user can change LLM settings at any time via the LARS Settings menu, with certain core-setting changes triggering a server restart
Settings are persisted between sessions via config.json, and on subsequent loads the llama.cpp server is launched based on these settings

I have not made any custom changes to llama.cpp so as to enable the user to upgrade to newer versions of llama.cpp without needing to wait for a LARS update. Honestly, I found llama.cpp to be so powerful, well-documented and capable that I've been 100% satisfied with it so far and not felt the need to make any custom changes!

I will likely be embarking on a significant enterprise implementation of LARS, wherein a llama.cpp server may be shared by multiple user sessions. I know from experience the llama.cpp server already supports queuing, but I will likely learn a lot during this implementation and share feedback and suggestions should I have any.

For the embedding models, LARS features four: sentence-transformers/all-mpnet-base-v2, BGE-Base, BGE-Large, OpenAI Text-Ada

These are currently handled via LangChain in LARS as this is one area I actually found the framework useful: plugging in and out different embedding models quickly to study RAG-performance changes.

Thank you so much once again for taking the time to look into LARS, I'm truly grateful and humbled to discuss my project with you! I will open a PR shortly, and am happy anytime to share more details about LARS and discuss llama.cpp!

I also humbly request you for feedback and suggestions on LARS: if there's anything specific I should focus on, any llama.cpp feature I should integrate and look into, any of the myriad of other amazing projects that you feel will benefit LARS and I should look into, I'm all ears and very eager to hear!

ggerganov Jun 14, 2024
Maintainer

I also humbly request you for feedback and suggestions on LARS: if there's anything specific I should focus on, any llama.cpp feature I should integrate and look into, any of the myriad of other amazing projects that you feel will benefit LARS and I should look into, I'm all ears and very eager to hear!

From user perspective I cannot provide much feedback as I am mainly interested in how 3rd party projects use llama.cpp at the developer level.

One thing that might be interesting to explore in that regard is to try to use llama.cpp also for computing the embeddings. Currently, if I understand correctly the code, LARS uses a Hugging Face engine to compute the embeddings:

https://github.com/abgulati/LARS/blob/c11dd07d9b8125efc3a4935716626434d5b94b92/web_app/app.py#L2060-L2064

Instead, it might be possible to utilize llama.cpp-based implementation. From a quick look at LangChain docs, there is this:

https://python.langchain.com/v0.1/docs/integrations/text_embedding/llamacpp/

I'm interested in that because I want to know what are the current limitations when it comes to computing embeddings with llama.cpp and how to improve it. Both in terms of performance and model support

abgulati Jun 14, 2024
Author

I have openned a PR to add LARS to the UI list on the llama.cpp README.

Thanks so much once again!

abgulati Jun 18, 2024
Author

Hi,

Apologies to bother in advance, but I openned a PR #7943 for this a few days ago, it's been approved but is not yet merged into the README. Just wanted to gently check in and request your help.

Thank you!!

abgulati Jun 18, 2024
Author

Thank you so much for the merge!!

I know this may sound silly, but having a project I built be acknowledged by one as renowed and respected as llama.cpp is one of my greatest accomplishments as a dev!! I aspire to contribute to llama.cpp too someday.

Thanks so much and wish you the best always.

abgulati · 2024-06-14T09:46:56Z

abgulati
Jun 14, 2024
Author

That's an excellent suggestion and I will dig into it. Thanks so much once again!

…

________________________________ From: Georgi Gerganov ***@***.***> Sent: Friday, June 14, 2024 2:43:15 AM To: ggerganov/llama.cpp ***@***.***> Cc: Abheek Gulati ***@***.***>; Author ***@***.***> Subject: Re: [ggerganov/llama.cpp] Made a RAG-centric, Open-Source UI based on llama.cpp - With Advanced Source Citations & Referencing: Pinpointing Page-Numbers, Incorporating Extracted Images, Text-highlighting & Document-Readers alongside Local LLM-generated... I also humbly request you for feedback and suggestions on LARS: if there's anything specific I should focus on, any llama.cpp feature I should integrate and look into, any of the myriad of other amazing projects that you feel will benefit LARS and I should look into, I'm all ears and very eager to hear! From user perspective I cannot provide much feedback as I am mainly interested in how 3rd party projects use llama.cpp at the developer level. One thing that might be interesting to explore in that regard is to try to use llama.cpp also for computing the embeddings. Currently, if I understand correctly the code, LARS uses a Hugging Face engine to compute the embeddings: https://github.com/abgulati/LARS/blob/c11dd07d9b8125efc3a4935716626434d5b94b92/web_app/app.py#L2060-L2064 Instead, it might be possible to utilize llama.cpp-based implementation. From a quick look at LangChain docs, there is this: https://python.langchain.com/v0.1/docs/integrations/text_embedding/llamacpp/ I'm interested in that because I want to know what are the current limitations when it comes to computing embeddings with llama.cpp and how to improve it. Both in terms of performance and model support — Reply to this email directly, view it on GitHub<#7928 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEZB3X3SQY7CEIEVHIRK7Y3ZHK3LHAVCNFSM6AAAAABJJINWASVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TONZSHA3DC>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Made a RAG-centric, Open-Source UI based on llama.cpp - With Advanced Source Citations & Referencing: Pinpointing Page-Numbers, Incorporating Extracted Images, Text-highlighting & Document-Readers alongside Local LLM-generated Responses #7928

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Made a RAG-centric, Open-Source UI based on llama.cpp - With Advanced Source Citations & Referencing: Pinpointing Page-Numbers, Incorporating Extracted Images, Text-highlighting & Document-Readers alongside Local LLM-generated Responses #7928

abgulati Jun 13, 2024

Here's a list detailing LARS's feature-set as it stands today:

Replies: 2 comments · 5 replies

ggerganov Jun 14, 2024 Maintainer

abgulati Jun 14, 2024 Author

ggerganov Jun 14, 2024 Maintainer

abgulati Jun 14, 2024 Author

abgulati Jun 18, 2024 Author

abgulati Jun 18, 2024 Author

abgulati Jun 14, 2024 Author

abgulati
Jun 13, 2024

Replies: 2 comments 5 replies

ggerganov
Jun 14, 2024
Maintainer

abgulati Jun 14, 2024
Author

ggerganov Jun 14, 2024
Maintainer

abgulati Jun 14, 2024
Author

abgulati Jun 18, 2024
Author

abgulati Jun 18, 2024
Author

abgulati
Jun 14, 2024
Author