Building a Semantic Document Search System

In today's data-driven world, organizations are drowning in unstructured information. PDF documents, reports, manuals, and other text-based resources contain valuable knowledge, but accessing this information efficiently remains challenging. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are gaining popularity, not every solution requires the generative AI component.

In this post, I'll walk through how I built a powerful semantic search system for documents that captures the "retrieval" part of RAG without the "generation" component - providing accurate document references without synthesizing new content.

The Architecture

Our system consists of two primary pipelines:

Document Processing Pipeline

This pipeline handles the ingestion and processing of documents:

PDF Document Collection: The starting point is a repository of PDF documents containing the information we want to make searchable.
Supabase Storage Upload: Documents are uploaded to Supabase storage, providing a centralized location for all our documents.
File Parsing via Llama Index: We utilize Llama Index to extract and structure the content from our PDFs. This tool effectively transforms unstructured documents into structured content.
Text Semantic Chunking: Using LangChain's Flask API (hosted on Vercel), we divide the document content into semantic chunks - logical sections that preserve context rather than arbitrary splits.
Text Embedding Generation: Each chunk is processed through Nomic-Embed-Text Flask API to generate vector embeddings. These embeddings capture the semantic meaning of text in a mathematical format.
Dual Storage Strategy:
- We store the text chunks in Supabase, indexed by unique embedding IDs.
- We upload the vector embeddings to Pinecone, a vector database optimized for similarity search.

Query Processing Pipeline

This pipeline handles user interactions:

User Query: The process begins when a user submits a text query seeking information.
Query Embedding: The user's query is converted into an embedding using the same Nomic-Embed-Text model, ensuring compatibility with our document embeddings.
Embedding Comparison: Pinecone's Query API compares the query embedding with stored document embeddings, returning the top 2 most semantically similar text chunks.
Reference Display: The system displays these references in the UI along with source information, helping users understand where the information originated.
Results Display: Finally, the system presents the retrieved information based on semantic relevance rather than keyword matching.

Technical Implementation Details

For this implementation, I leveraged several key technologies:

Embedding Model: Nomic-Embed-Text provides high-quality embeddings for both document chunks and user queries.
Vector Database: Pinecone stores and efficiently searches through vector embeddings.
Storage Solution: Supabase stores both the original documents and the text chunks.
Processing Tools: Llama Index for document parsing and LangChain for semantic chunking.
Deployment: All API components are deployed on Vercel for reliable scaling.

The Benefits of This Approach

By implementing a "RAG without the AI" approach, we gain several advantages:

Reference Transparency: Users receive direct references to relevant documents rather than AI-generated summaries that might contain hallucinations.
Semantic Understanding: Unlike traditional keyword search, this system understands the meaning behind queries, returning contextually relevant results.
Source Verification: Each result links directly to its source document, enabling users to verify information.
Reduced Complexity: Without the generative component, the system is simpler to implement, debug, and maintain.
Lower Computational Requirements: Vector similarity search requires fewer resources than running large language models.

Real-World Applications

This system is particularly valuable for:

Legal Firms: Searching through case law and precedents
Healthcare Organizations: Finding relevant medical documentation
Financial Institutions: Locating specific regulatory guidance
Research Organizations: Discovering relevant papers and findings
Educational Institutions: Connecting students with relevant learning materials

Conclusion

Building a semantic document search system using embedding-based retrieval provides organizations with a powerful tool to unlock the value hidden in their unstructured data. By focusing on the retrieval component without the generative AI aspect, we create a system that:

Delivers accurate, source-verified information
Understands the semantic meaning behind user queries
Scales efficiently with growing document collections
Maintains transparency in information retrieval

For organizations with large collections of documents that need to be searchable by meaning rather than just keywords, this approach offers significant value. It bridges the gap between traditional search and full RAG systems, providing a practical solution for making institutional knowledge accessible without the complexity and potential pitfalls of generative AI.

The next time you're considering implementing a document search solution, remember that sometimes you don't need the "G" in RAG to deliver transformative results.

P.S. Let's Build Something Cool Together!

Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Semantic Search Data Engineering Pipeline: RAG Without the AI

Building a Semantic Document Search System

The Architecture

Document Processing Pipeline

Query Processing Pipeline

Technical Implementation Details

The Benefits of This Approach

Real-World Applications

Conclusion

P.S. Let's Build Something Cool Together!

Comments

More from this blog

The Two Numbers That Predict AI Agent Reliability

The LLM Council and the Human Mind

One-Shot Trauma: When Reinforcement Learning and Human Minds Overcorrect

What is AI?

From "It Works" to "Why It Works": A Call for Deeper Understanding in Data Science

Command Palette

Building a Semantic Document Search System

The Architecture

Document Processing Pipeline

Query Processing Pipeline

Technical Implementation Details

The Benefits of This Approach

Real-World Applications

Conclusion

P.S. Let's Build Something Cool Together!

Comments

More from this blog