Systematically Improving Your RAG

Xenoamorphous

Why is there so much buzz about RAG?

Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.

What I’m missing here? What makes it so useful?

7thpower

This is a great intro. I am amazed how many people don’t use the LLMs to analyze the questions themselves and apply filters to avoid pulling back irrelevant documents in the first place.

We run as many methods as practical in parallel (sql, vector, full text, other methods, etc.) and return the first one that meets our threshold. Vector search is almost never the winner relative to full text.

Instead, I see a lot of people in sister companies using the most robust models they can find and having agents to do chain of thought, while their users are wondering when, if ever, they’ll get a response back.

danenania

This all seems pretty sensible. Another area that would be nice to see addressed are strategies for balancing latency/cost/performance when data is frequently updated. I'm building a terminal-based AI coding tool[1] and have been thinking about how to bring RAG into the picture, as it clearly could add value, but the tradeoffs are tricky to get right.

The options, as far as I can tell, are:

- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.

- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.

- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.

I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.

1 - https://github.com/plandex-ai/plandex

hdlothia

So the part of RAG that's tripping me up right now is vector search and familiarity scores. Does anyone have a good resource to learn more about this?

I've been using this as a starter. https://developers.cloudflare.com/workers-ai/tutorials/build... I put in text but I feel like my conception of what should get high relevancy scorrs doesn't match the percentages that come out.

The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?

I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.

psynister

Most of this can be done automatically using https://vectorize.io

It generates synthetic questions, tests different embedding models, chunking strategies, etc. You end up with clear data that shows you what will give you the optimal results for your RAG app: https://platform.vectorize.io/public/experiments/ca60ce85-26...

demilich

Try RAPTOR: https://arxiv.org/html/2401.18059v1

An implementation: github.com/infiniflow/ragflow

DerCommodore

[flagged]

satisfice

Not a lot of content, here.

ofermend

Building RAG can be easy for a simple example, but it's much more nuanced than you might think when you try to do it at larger scale.

With larger-scale real-world enterprise RAG-based applications, you soon realize the enormous time and effort required to experiment with all these levers to optimize the RAG pipeline: which vector DB to use and how, which embedding model to use, pure vector search or hybrid search, chunking strategies, and on and one...

With Vectara's RAG-as-a-service (www.vectara.com) we try to help address exactly this issue: you get an optimized, high performance, secure and scalable RAG pipeline, so you don't need to go through this massive hyper-parameter tuning exercise. Yes, there are still some very useful levers you can experiment with, but only where it really matters.