Semantic Search Data Engineering Pipeline: RAG Without the AI
In this blog, we demonstrate how to make a semantic search pipeline through the use of Pinecone vector database, python flask endpoints, and Next.js

Building a Semantic Document Search System
In today's data-driven world, organizations are drowning in unstructured information. PDF documents, reports, manuals, and other text-based resources contain valuable knowledge, but accessing this information efficiently remains challenging. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are gaining popularity, not every solution requires the generative AI component.
In this post, I'll walk through how I built a powerful semantic search system for documents that captures the "retrieval" part of RAG without the "generation" component - providing accurate document references without synthesizing new content.
The Architecture
Our system consists of two primary pipelines:
Document Processing Pipeline
This pipeline handles the ingestion and processing of documents:
PDF Document Collection: The starting point is a repository of PDF documents containing the information we want to make searchable.
Supabase Storage Upload: Documents are uploaded to Supabase storage, providing a centralized location for all our documents.
File Parsing via Llama Index: We utilize Llama Index to extract and structure the content from our PDFs. This tool effectively transforms unstructured documents into structured content.
Text Semantic Chunking: Using LangChain's Flask API (hosted on Vercel), we divide the document content into semantic chunks - logical sections that preserve context rather than arbitrary splits.
Text Embedding Generation: Each chunk is processed through Nomic-Embed-Text Flask API to generate vector embeddings. These embeddings capture the semantic meaning of text in a mathematical format.
Dual Storage Strategy:
We store the text chunks in Supabase, indexed by unique embedding IDs.
We upload the vector embeddings to Pinecone, a vector database optimized for similarity search.
Query Processing Pipeline
This pipeline handles user interactions:
User Query: The process begins when a user submits a text query seeking information.
Query Embedding: The user's query is converted into an embedding using the same Nomic-Embed-Text model, ensuring compatibility with our document embeddings.
Embedding Comparison: Pinecone's Query API compares the query embedding with stored document embeddings, returning the top 2 most semantically similar text chunks.
Reference Display: The system displays these references in the UI along with source information, helping users understand where the information originated.
Results Display: Finally, the system presents the retrieved information based on semantic relevance rather than keyword matching.
Technical Implementation Details
For this implementation, I leveraged several key technologies:
Embedding Model: Nomic-Embed-Text provides high-quality embeddings for both document chunks and user queries.
Vector Database: Pinecone stores and efficiently searches through vector embeddings.
Storage Solution: Supabase stores both the original documents and the text chunks.
Processing Tools: Llama Index for document parsing and LangChain for semantic chunking.
Deployment: All API components are deployed on Vercel for reliable scaling.
The Benefits of This Approach
By implementing a "RAG without the AI" approach, we gain several advantages:
Reference Transparency: Users receive direct references to relevant documents rather than AI-generated summaries that might contain hallucinations.
Semantic Understanding: Unlike traditional keyword search, this system understands the meaning behind queries, returning contextually relevant results.
Source Verification: Each result links directly to its source document, enabling users to verify information.
Reduced Complexity: Without the generative component, the system is simpler to implement, debug, and maintain.
Lower Computational Requirements: Vector similarity search requires fewer resources than running large language models.
Real-World Applications
This system is particularly valuable for:
Legal Firms: Searching through case law and precedents
Healthcare Organizations: Finding relevant medical documentation
Financial Institutions: Locating specific regulatory guidance
Research Organizations: Discovering relevant papers and findings
Educational Institutions: Connecting students with relevant learning materials
Conclusion
Building a semantic document search system using embedding-based retrieval provides organizations with a powerful tool to unlock the value hidden in their unstructured data. By focusing on the retrieval component without the generative AI aspect, we create a system that:
Delivers accurate, source-verified information
Understands the semantic meaning behind user queries
Scales efficiently with growing document collections
Maintains transparency in information retrieval
For organizations with large collections of documents that need to be searchable by meaning rather than just keywords, this approach offers significant value. It bridges the gap between traditional search and full RAG systems, providing a practical solution for making institutional knowledge accessible without the complexity and potential pitfalls of generative AI.
The next time you're considering implementing a document search solution, remember that sometimes you don't need the "G" in RAG to deliver transformative results.
P.S. Let's Build Something Cool Together!
Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.
If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!



