0

I am tasked to build a production level RAG application over CSV files. Possible Approches:

  • Embedding --> VectorDB --> Taking user query --> Similarity or Hybrid Search --> LLM --> Result
  • Csv to pandas df --> Ask LLM for py code to query from user prompt --> Query in df --> Give to LLM for analysis --> Result

First approach is giving vague answer for using unstructured approach to structured data and second is doing very good but I suspect its scalability. I need suggesstion.

1 Answer 1

1

Depends on how many and large your files are in production. Can you describe your use case a bit more?

  • How many csv files do you think you have?
  • are the files related to each other like tables in a database.

One issue you are going to run into with even approach #2 is how do you know which csv file to load in dataframe?

One approach, though not used in production environment, that worked well for me was

  1. during indexing, instead of loading the whole csv into vectordb, use LLM to summarize the csv file. Index the summary and make sure file path is included in the document metadata
  2. At retrieval time, you should get good hits due the table summary. Find the source csv from the document's metadata and load that into the dataframe.

Obviously this approach might get expensive if you have tons csv files.

There are a few other methods like Chain of tables and Mix-Self-Consistency approach that I have read but not implemented

Sign up to request clarification or add additional context in comments.

3 Comments

Thats great. Csv files will have approximately 200 to 300 rows and we may have around 10 to 20 at least for now. We are getting csv file from the Oracle endpoint that is managed by other teams. I am tasked to build this RAG end. We also have Pinecone under our umbrella.
oh 200-300 rows. You can actually load the whole thing into context window. Most LLMs do a good job of answering questions if you give them a whole csv. Have you tried just putting the whole CSV in the prompt and asking it questions? The question still remains on how do you find the right csv file from the 20-30 you have... you might still have to store the summary in your vector db but instead of loading the csv file in dataframe, may be put the whole file in as prompt
Since this production could grow, I was thinking of putting the whole csv into llm but if the token limit arrives, go for the pandas approach. Do you think this is doable?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.