I think you need around 100 or so.
In some cases, you may need much more than that if your RAG questions are diverse.
Our team is using SOTA models within acceptable security boundaries.
Currently, we are primarily using GPT-4o.
Instead of using the text after the chunk directly as the Contents of the corpus, (title) + (summary or metadata) + (chunked text) for embedding.
It's a good tip and trick to try when improving retrieval performance :)
❗But remember, different data will perform differently. ❗
Some preprocessors may improve Retrieval performance significantly on certain data, while others may only improve it marginally or even decrease it.
In the end, you'll need to experiment to find the best method for your data.
AutoRAG was created to make these experiments easy and fast, so we recommend using AutoRAG to do some quick experiments 😁.
The more jargon-ridden a domain is, it is important to construct a realistic evaluation QA dataset.
For non-experts, they are less familiar with the jargon and ask more vague questions than precise ones, so retrieval using VectorDB
with high semantic similarity may perform better.