I'm building a RAG-based document QA system using Python (no LangChain), LLaMA (50K context), PostgreSQL with pgvector, and Docling for parsing. Users can upload up to 10 large documents (300+ pages each), often containing numerous tables and charts.
I'm facing a few specific challenges: 30K+ total chunks across all docs → KNN retrieval gets noisy. Tried LLM-based reranking, but it's too slow and expensive to run on all 30K chunks. Tried summarizing each chunk to improve retrieval, but: It's too expensive to generate LLM summaries for all 30K sections. Table chunks are especially difficult: Embeddings perform poorly on structured/numeric data. Summary-style embeddings (e.g. first 300 tokens, or using just heading/caption) aren’t sufficient for value-level lookups. Looking for ideas or proven strategies to: Improve precision in initial retrieval at scale Handle table-heavy content more effectively Reduce cost while preserving accuracy
Any ideas, techniques, or tooling (besides LangChain) that worked for you?