How We Built a RAG System That Actually Works: Lessons from the Trenches at Navatech

Lakshay Chhabra

Jul 3, 2025 • 13 min read

A 3-month sprint of building, breaking, and rebuilding an enterprise search system that doesn't hallucinate

The Problem That Started It All

LLMs hallucinate with terrifying confidence. Test one on crane operation limits, and it might tell you "operate normally in winds up to 35 mph" when your manual's actual limit is 20 mph. The LLM isn't lying, it's pattern-matching from whatever training data it saw, which could be from different equipment, different manufacturers, or pure statistical interpolation.

In construction, where exceeding wind limits has killed operators, these hallucinations aren't theoretical—they're lethal. And that's exactly why we at Navatech decided to build a Retrieval-Augmented Generation (RAG) system from scratch.

If you're not familiar with the term, think of RAG as giving an AI assistant a library card. Instead of making things up based on what it "thinks" it knows, it looks up information from trusted documents before answering. Simple concept, nightmare to implement at scale.

Why Not Just Use ChatGPT? (And Why Most RAG Systems Fail)

"Why not just upload docs to ChatGPT?" - Every PM ever

Here's why:

Source Attribution: When ChatGPT says "install scaffolding at 45-degree angles," which manual is that from? Which version? Which page?
Faithfulness: In construction safety, being 95% correct means 5% of your workers are at risk
Document Control: Your safety manuals update quarterly. ChatGPT's knowledge doesn't
Compliance: "An AI told me to" doesn't hold up in court. "Section 3.2 of the certified safety manual says" does

"Okay, then we'll just build a quick RAG system with Pinecone!" - Also every PM

Let me save you three months of pain by sharing what doesn't work:

The "Quick and Dirty" RAG Approach That Fails

What people do:

The "tutorial" approach
pdf_text = extract_text(pdf)
chunks = split_by_tokens(pdf_text, 1000)
embeddings = get_embeddings(chunks)
pinecone.upsert(embeddings)  # Ship it!

Why it fails spectacularly:

Real example from our first attempt: Query: "What's the max load for Type A scaffolds?"

What the system returned:

"Type A Type B Type C 225 450 675 kg/m² kg/m² kg/m² Light Medium Heavy Duty Duty Duty"

The table structure was completely destroyed. The AI had no idea which number belonged to which type. This isn't just wrong—it's dangerous.

Common Mistakes We See (And Made):

Dumping Raw PDFs: Tables become word soup, images disappear, context vanishes
Ignoring Structure: Retrieved section 4.2.1 without knowing it's under "Emergency Procedures"
Generic Embeddings: System thinks "crane" is a bird, not construction equipment
No Document Control: Mixes 2019 procedures with 2024 updates
No Evaluation: "Ship it and see what happens" = lawsuits

The "Just Use Pinecone" Trap

Don't get me wrong—Pinecone is great. But it's a vector database, not a RAG system. It's like having a Ferrari engine but no car. The vector store is 10% of the solution. The other 90%:

Document preprocessing (40%)
Chunking strategy (20%)
Retrieval logic (15%)
Generation controls (15%)

We learned this the hard way. Week 1 looked like:

✅ Set up Pinecone (2 hours)
✅ Ingested 1000 PDFs (4 hours)
❌ Tested with real queries (disaster)
😱 "Maybe RAG doesn't work for construction?"

Then we built it right. Which is what this blog is about.

What We Were Up Against

When we started this project at Navatech, our construction clients were drowning in documents:

100,000+ safety manuals, method statements, and RAMS (Risk Assessment Method Statements)
Building codes that changed with each jurisdiction
Mixed formats: PDFs with hand-drawn diagrams, Word docs with track changes from 15 different reviewers, Excel sheets containing critical load calculations

Our mandate was clear: Build a system that could answer questions about these documents with 99.9% accuracy, handle thousands of queries per hour, and never, never make things up.

Chapter 1: The Document Preprocessing Nightmare

The Reality Check

Our first wake-up call came when we tried to process a seemingly simple PDF. It was a 200-page equipment manual with:

Tables that spanned multiple pages (with headers only on the first page)
Diagrams with text annotations scattered around them
Footnotes that referenced other footnotes
Scanned pages mixed with digital text

Traditional PDF parsers either crashed or produced gibberish. One memorable output turned a safety warning into a recipe for disaster by merging it with an unrelated table about temperature settings.

What Actually Worked

After testing 5+ document parsing libraries, we settled on a hybrid approach:

http://Unstructured.io for the heavy lifting—it understood document structure better than anything else we tried
BeautifulSoup for fine-grained control over the converted HTML/XML
Custom parsers for specific document types.

But here's the key insight: We converted everything to Markdown.

Why Markdown? Let me show you:

## Scaffold Erection Procedures

### Table 3.1: Load Limits by Platform Type
| Platform Type | Max Load (kg/m²) | Safety Factor |
|--————|——————|—————|
| Light Duty | 225 | 4:1 |
| Medium Duty | 450 | 4:1 |
| Heavy Duty | 675 | 4:1 |

**Critical**: Never exceed 75% of maximum rated load.

### Installation Requirements
- Competent person inspection required
- Base plates on firm foundation
- Cross bracing every 20 feet vertically

This format preserved the relationships between elements. A load table and its safety warnings stayed together. Cross-references to inspection requirements remained intact.

The Learning Moment

One client had a critical procedure split across three pages, with a diagram in the middle. Our first parser treated each page as a separate document. An employee following the AI's advice would have skipped crucial safety steps. That near-miss taught us: Context preservation isn't optional—it's everything.

Chapter 2: The Art and Science of Chunking

What Chunking Means (In Human Terms)

Imagine you're creating a set of index cards from a textbook. Each card needs to:

Make sense on its own
Not be so long that it's unwieldy
Not be so short that it's useless

That's chunking for RAG systems.

Our Journey Through Chunking Strategies

Attempt 1: The Naive Approach We started by splitting documents every 1000 characters. The results were... educational. We'd get chunks like:

"...and under no circumstances should you mix Chemical A with Chemical B as this will cause an expl"

The next chunk started with "osion." Not ideal for a safety manual.

Attempt 2: The Page-Based Method "Let's just use page boundaries!" we thought. Then we discovered that one client's PDFs had been created by scanning double-sided documents incorrectly. Page 1 was the cover, page 2 was the back of page 50, page 3 was page 2... You get the picture.

Attempt 3: The Breakthrough We realized documents have natural boundaries—headings, sections, paragraphs. Our tag-based semantic chunking with BeautifulSoup was born:

 from bs4 import BeautifulSoup, NavigableString
def semantic_chunk(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    chunks = []
    current_chunk = {
        'content': '',
        'metadata': {}
    }
    
    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'table', 'ul', 'ol']):
        if element.name in ['h1', 'h2', 'h3']:
            # Save previous chunk if it exists
            if current_chunk['content']:
                chunks.append(current_chunk)
            # Start new chunk with heading
            current_chunk = {
                'content': element.get_text(),
                'metadata': {
                    'heading_level': element.name,
                    'section': element.get_text()
                }
            }
        elif element.name == 'table':
            # Tables stay together with their section
            table_md = convert_table_to_markdown(element)
            if len(current_chunk['content']) + len(table_md) > MAX_CHUNK_SIZE:
                chunks.append(current_chunk)
                current_chunk = {'content': table_md, 'metadata': {'has_table': True}}
            else:
                current_chunk['content'] += '\n' + table_md
        else:
            # Add to current chunk
            current_chunk['content'] += '\n' + element.get_text()
    
    return chunks

The Table Problem (And Our Solution)

Tables were our nemesis. A 50-row table about chemical properties can't fit in a single chunk, but splitting it randomly loses meaning. Our solution:

Keep small tables intact (under 20 rows)
For large tables, split by logical groups (e.g., chemicals A-M, N-Z)
Always include headers in every chunk
Add context about what was omitted ("Rows 21-50 available in next chunk")

The Clever Bit: Chunk Small, Retrieve Big

Here's where we got smart. We chunk small (500 tokens) for precise retrieval, but we return the entire page to the LLM. Why?

In construction safety, context is everything:

A procedure's warnings might be 3 paragraphs away
Diagrams often explain the text
Exception clauses hide in footnotes

So our approach:

Index small chunks for accurate semantic search
Store page boundaries in metadata
Return full pages that contain the matching chunks

This way, when someone asks "How do I install guardrails?", they get the complete procedure, not just the paragraph mentioning guardrails.

Chapter 3: Embeddings - Teaching Computers to Understand Meaning

The Non-Technical Explanation

Embeddings are like creating a "meaning fingerprint" for text. Similar meanings get similar fingerprints. It's how the system knows that "fire extinguisher location" and "where to find emergency fire suppression equipment" are asking about the same thing.

Our Embedding Model Journey

Month 1-2: OpenAI's text-embedding-002

Pros: Reliable, well-documented
Cons: Struggled with technical jargon
Memorable failure: Thought "PCB disposal" (Polychlorinated Biphenyls) was about computer circuit boards

Month 3: Google's Embedding Models

Pros: Fantastic with technical content
Cons: Rate limits hit us every hour (not just peak times!)
The breaking point: "You have exceeded your quota" became our most common error message

Month 2-3: OpenAI's text-embedding-003

The goldilocks solution: Good enough performance, rock-solid reliability
Only 1% worse than Google on our benchmarks, but actually available when we needed it

The Jargon Problem

Construction sites have their own language:

"RAMS" (Risk Assessment Method Statements)
"PTW" (Permit to Work)
"SWMS" (Safe Work Method Statements)
"SWL" (Safe Working Load) vs "WLL" (Working Load Limit)

Standard embedding models had never seen these terms used correctly. Our solution? We created a glossary preprocessing step:

User query: "Where's the RAMS for working at height?"
Expanded query: "Where's the RAMS (Risk Assessment Method Statement) for working at height elevated work?"

This simple trick improved retrieval accuracy by 15%.

Chapter 4: Building the Data Pipeline (Or: How I Learned to Stop Worrying and Love Parallel Processing)

The Scale Challenge

Processing 100,000 documents sounds abstract until you do the math:

Average processing time per document: 3.6 seconds
Sequential processing time: 100 hours
Client's patience: 4 hours

Parallel Processing Adventures

Our first attempt at parallelization was... enthusiastic. We spawned 1000 threads and promptly crashed our servers. The Azure API started returning 429 errors (too many requests), and our monitoring dashboard looked like a Christmas tree.

Here's what actually worked:

# The sweet spot we found

WORKER_PROCESSES = 16
BATCH_SIZE = 100
RATE_LIMIT_DELAY = 0.1  # seconds between batches

# Process documents in controlled batches
with ProcessPoolExecutor(max_workers=WORKER_PROCESSES) as executor:
    for batch in document_batches:
        futures = [executor.submit(process_doc, doc) for doc in batch]
        results = [f.result() for f in futures]
        
        # Respect the API's feelings
        time.sleep(RATE_LIMIT_DELAY)
        upload_to_azure(results)

The Metadata Bug That Changed Everything

Three months in, we noticed something odd. Our retrieval was working fine—we were getting the right chunks back. But our relevance scores were terrible. Why?

After 12 hours of debugging, we found it: LlamaIndex was embedding our metadata along with the content. Every chunk was being embedded with 7,000 tokens of invisible text:

chunk_id: doc_12345_chunk_67
source_document: scaffold_safety_manual_v2.pdf
page_numbers: 45-47
last_modified: 2024-03-15
document_type: safety_manual
...actual 500 tokens of content here...

The fix was surprisingly simple:

def safe_get_content(self, metadata_mode=None) -> str:
    return self.text  # Return ONLY the text, ignore metadata

TextNode.get_content = safe_get_content  # Monkey patch

The impact was massive:

Embedding size: 7,000 tokens → 500 tokens
Embedding cost: Down 93%
Relevance scores: Up 10%
Confidence in results: Through the roof

By removing all that noise, our embeddings finally captured what actually mattered—the content.

Chapter 5: Retrieval - Finding Needles in a Digital Haystack

The Hybrid Approach

Think of retrieval like looking for a book in a library. You might:

Remember exact words from the title (sparse retrieval)
Remember what it was about (dense retrieval)

Best results? Use both.

Azure Search: Our Unexpected Hero

We initially wanted to build our own vector database. "How hard could it be?" (Narrator: It was very hard.)

Azure AI Search saved us months of work:

Handled both vector and keyword search
Scaled to millions of documents without breaking a sweat
Built-in security filters (crucial for enterprise use)

The Bug That Made Us Question Reality

Azure's Python SDK had a subtle bug in hybrid search. It would silently fail and return only keyword results. For weeks, we thought our embeddings were broken. The fix? Direct API calls with the full payload:

payload = {
    "search": query_text,  # Keyword search. "" for vector only
    "vectorQueries": [
        {
            "kind": "vector", 
            "vector": query_embeddings, 
            "fields": "embedding", 
            "k": top_k * 6,  # Get more candidates for reranking
            "exhaustive": True,
        }
    ],
    "filter": filter_expression,
    "queryType": "semantic",
    "semanticConfiguration": "default-config",
    "captions": "extractive",
    "answers": "extractive|count-" + str(top_k),
    "top": top_k 
}

response = requests.post(
    f"{endpoint}/indexes/{index}/docs/search?api-version=2023-11-01",
    headers={"api-key": api_key},
    json=payload
)

The exhaustive: True flag was crucial—it ensured we searched the entire index, not just a sample.

Document-Specific Queries

Users often wanted answers from specific documents:

"What does the crane manual say about wind limits?"
"Check the 2024 building code for foundation requirements"

We added semantic document filtering:

def handle_document_specific_query(query, documents_mentioned):
    # Use embeddings to find the most relevant document
    doc_embeddings = get_embeddings(documents_mentioned)
    
    # Semantic search for document names
    relevant_docs = semantic_search_documents(doc_embeddings)
    
    # Add filter to search
    filter_expression = f"document_name eq '{relevant_docs[0]}'"
    
    return search_with_filter(query, filter_expression)

This let users naturally reference documents without knowing exact filenames.

Chapter 6: Generation - Making the AI Actually Helpful

The Context Window Challenge

LLMs have a context limit—think of it as their "working memory." GPT-4o can handle about 8,000 tokens (roughly 6,000 words). Sounds like a lot until you're trying to include:

User's question
Conversation history
10 retrieved document chunks
System instructions

Our solution was ruthless prioritization:

def prepare_context(query, retrieved_chunks, history):
    # Start with essentials
    context = system_prompt + query
    remaining_tokens = MAX_TOKENS - count_tokens(context)
    
    # Add chunks by relevance until we run out of space
    for chunk in sorted(retrieved_chunks, by='relevance'):
        if count_tokens(chunk) < remaining_tokens:
            context += chunk
            remaining_tokens -= count_tokens(chunk)
    
    # Add recent history if there's room
    # ...

Query Understanding and Rewriting

Users don't always ask clear questions. Real examples from our logs:

"That thing about the chemical spill"
"What John sent last week about safety"
"The new procedure (not the old one)"

Our query rewriting pipeline:

Intent Detection: Is this a lookup, comparison, or clarification?
Entity Extraction: What documents, topics, or time periods?
Expansion: Add synonyms and related terms
Context Integration: Use conversation history

Example transformation:

Original: "That thing about the chemical spill"
After processing: "chemical spill response procedure incident protocol hazmat"

Generation Tuning: The Endless Quest for Zero Hallucinations

Even with perfect retrieval, LLMs can still get creative. We tested endless combinations:

# Our final generation parameters (after 100+ experiments)
generation_config = {
    "temperature": 0.1,  # Low for factual accuracy
    "top_p": 0.9,       # Some diversity, but not too much
    "frequency_penalty": 0.3,  # Reduce repetition
    "presence_penalty": 0.0,   # Don't force novelty
    "max_tokens": 2000,
    "model": "gpt-4o" 
}

Key learnings:

Lower temperature = More consistent, less creative
Frequency penalty helped with repetitive safety warnings
GPT-4o vs GPT-4.1 vs GPT-4.1-mini vs More

Multi-Language Support (Because Construction is Global)

Our sites operate worldwide. The solution:

Query Processing:

# Detect language
source_lang = detect_language(user_query)

# Translate to English for retrieval
english_query = translate_to_english(user_query)

# Search in English (all docs are in English)
results = search(english_query)

# Translate response back
response = generate_response(results)
translated_response = translate_to_language(response, source_lang)

Document Cleaning: During ingestion, we also removed any non-English text that accidentally made it into safety manuals (surprising how often this happened)

Chapter 7: Evaluation - Measuring What Matters

Building a Test Set (AKA: The Most Painful Month of My Life)

We needed to know: Is this thing actually working? Our evaluation approach:

Generated 1,000 synthetic questions using GPT-4o:"Based on this scaffold safety section, generate 5 questions a construction worker might realistically ask"
Manual annotation - Let me tell you about pain...

The manual annotation process:

Me and 10,000 mg of caffeine
Each query needed: correct document, correct section, acceptable answer
Took a week of mind-numbing work
Found errors in source documents (bonus outcome!)
Created gold standard that caught issues automated testing missed

Example annotation:

{
"query": "Maximum wind speed for crane operation?",
"correct_docs": ["crane_safety_manual_v3.pdf", "site_weather_policy.pdf"],
"correct_sections": ["Section 4.3", "Appendix B"],
}

Key Metrics That Mattered

Recall@K: Of the top K retrieved chunks, how many contain the answer?

Recall@3: 91% (great for focused questions)
Recall@10: 97% (catches edge cases)

Faithfulness: Is the generated answer supported by retrieved documents?

Measured using RAGAS framework
Our score: 94% (6% needed manual review)

Response Time:

P50: 2.1 seconds
P95: 3.8 seconds
P99: 5.2 seconds (usually complex multi-hop queries)

The Failure Analysis That Saved Us

Every wrong answer taught us something:

Failure Type 1: Temporal Confusion

Question: "What's the current procedure for waste disposal?"
System retrieved: Outdated 2019 procedure
Fix: Added "effective date" metadata and filtering

Failure Type 2: Partial Retrieval

Question: "Complete checklist for equipment startup"
System retrieved: Only items 1-5 of a 10-item list
Fix: Improved chunk boundary detection for lists

Failure Type 3: Acronym Confusion

Question: "PPE requirements for height work"
System confused: Personal Protective Equipment vs. some random engineering term
Fix: Context-aware acronym expansion with construction-specific dictionary

Chapter 8: Lessons Learned and Battle Scars

Technical Lessons

Start with data quality: Garbage in, garbage out. We spent 40% of our time on preprocessing.
Monitoring is not optional: We track everything:
- Query latency by component
- Retrieval relevance scores
- User feedback (thumbs up/down)
- API costs (those embeddings add up!)
Build for failure: Everything fails. APIs go down. Models return nonsense. Have fallbacks.
Test with real data early: Our synthetic tests missed edge cases real documents exposed.

Business Lessons

RAG is not a silver bullet: It solves hallucination but introduces complexity. Make sure the tradeoff is worth it.
User training matters: Even the best system fails if users don't know how to query it.
Incremental rollout saves lives: We started with one department, learned, adjusted, then expanded.
Cost modeling is crucial:
- Document conversion: ~₹5000/month for processing services
- Embedding costs: $0.13/million tokens × millions of chunks = real money
- Storage: ₹16,000/month for our Azure index
- LLM inference: ~$0.03 per query (more with conversation history)
- Query expansion and caching helped control costs
- At scale, this adds up quickly (but still cheaper than lawsuits)

Human Lessons

Document your decisions: Six months later, you won't remember why you chose that chunk size.
Celebrate small wins: First successful retrieval. First day without crashes. These matter.
Listen to users: Our best improvements came from user complaints, not our clever ideas.

Chapter 9: What's Next

The Immediate Roadmap

Performance Improvements (Already Done!):

Implemented intelligent caching for common queries
Considering dedicated vector database for sub-second responses
Smart cache invalidation when documents update

Better Understanding:

Fine-tune embeddings on construction terminology
Implement query intent classification
Multi-language support already live (query in Arabic, get answers from English docs)

Advanced Features:

Multi-document reasoning ("Compare scaffold procedures across all our sites")
Temporal queries ("What changed in crane regulations since last year?")
Graph RAG for discovering relationships between safety procedures (experimental)

The Dream Features

What's actually in our pipeline:

Voice Interface: "Hey NavBot, what's the lockout procedure for this equipment?" (Q2 2025)
Predictive Safety: "Based on these incident reports, similar accidents likely on rainy days"

Final Thoughts: Was It Worth It?

Three months. Countless late nights. More Python stack traces than I care to remember. Was building a RAG system from scratch worth it?

When I see a crane operator quickly verify wind speed limits before a critical lift—yes.

When our system prevents someone from using outdated scaffold procedures—absolutely.

When we can show an OSHA inspector exactly which manual section backs up our safety practices—without question.

Building a production RAG system is hard. Really hard. But in construction, where a single wrong answer could cost lives, it's the only responsible path forward.

To anyone embarking on this journey: Document everything. Test ruthlessly. Listen to your users. And remember—every bug you fix is one less 3 AM phone call about a workplace accident.

Acknowledgments

This project wouldn't have been possible without:

The Navatech engineering team who debugged alongside me
Our beta clients who patiently reported "weird results"
Approximately 1,247 cups of coffee

Resources and Code Samples

While I can't share our proprietary code, here are the open-source tools that made this possible:

Document Processing: Unstructured.io
Orchestration: LlamaIndex (with our patches)
Evaluation: RAGAS
Vector Store: Azure AI Search (or try FAISS/Chroma for open-source)

Have questions about building your own RAG system? Find me on LinkedIn, Email (lakshay.chhabra@navatech.ai) or drop a comment below. Always happy to help fellow engineers avoid the pitfalls we discovered the hard way.