Navatech Engineering

How We Built a RAG System That Actually Works: Lessons from the Trenches at Navatech

Lakshay Chhabra — Thu, 03 Jul 2025 05:50:21 GMT

A 3-month sprint of building, breaking, and rebuilding an enterprise search system that doesn't hallucinate

The Problem That Started It All

LLMs hallucinate with terrifying confidence. Test one on crane operation limits, and it might tell you "operate normally in winds up to 35 mph" when your manual's actual limit is 20 mph. The LLM isn't lying, it's pattern-matching from whatever training data it saw, which could be from different equipment, different manufacturers, or pure statistical interpolation.

In construction, where exceeding wind limits has killed operators, these hallucinations aren't theoretical—they're lethal. And that's exactly why we at Navatech decided to build a Retrieval-Augmented Generation (RAG) system from scratch.

If you're not familiar with the term, think of RAG as giving an AI assistant a library card. Instead of making things up based on what it "thinks" it knows, it looks up information from trusted documents before answering. Simple concept, nightmare to implement at scale.

Why Not Just Use ChatGPT? (And Why Most RAG Systems Fail)

"Why not just upload docs to ChatGPT?" - Every PM ever

Here's why:

Source Attribution: When ChatGPT says "install scaffolding at 45-degree angles," which manual is that from? Which version? Which page?
Faithfulness: In construction safety, being 95% correct means 5% of your workers are at risk
Document Control: Your safety manuals update quarterly. ChatGPT's knowledge doesn't
Compliance: "An AI told me to" doesn't hold up in court. "Section 3.2 of the certified safety manual says" does

"Okay, then we'll just build a quick RAG system with Pinecone!" - Also every PM

Let me save you three months of pain by sharing what doesn't work:

The "Quick and Dirty" RAG Approach That Fails

What people do:

The "tutorial" approach
pdf_text = extract_text(pdf)
chunks = split_by_tokens(pdf_text, 1000)
embeddings = get_embeddings(chunks)
pinecone.upsert(embeddings)  # Ship it!

Why it fails spectacularly:

Real example from our first attempt: Query: "What's the max load for Type A scaffolds?"

What the system returned:

"Type A Type B Type C 225 450 675 kg/m² kg/m² kg/m² Light Medium Heavy Duty Duty Duty"

The table structure was completely destroyed. The AI had no idea which number belonged to which type. This isn't just wrong—it's dangerous.

Common Mistakes We See (And Made):

Dumping Raw PDFs: Tables become word soup, images disappear, context vanishes
Ignoring Structure: Retrieved section 4.2.1 without knowing it's under "Emergency Procedures"
Generic Embeddings: System thinks "crane" is a bird, not construction equipment
No Document Control: Mixes 2019 procedures with 2024 updates
No Evaluation: "Ship it and see what happens" = lawsuits

The "Just Use Pinecone" Trap

Don't get me wrong—Pinecone is great. But it's a vector database, not a RAG system. It's like having a Ferrari engine but no car. The vector store is 10% of the solution. The other 90%:

Document preprocessing (40%)
Chunking strategy (20%)
Retrieval logic (15%)
Generation controls (15%)

We learned this the hard way. Week 1 looked like:

✅ Set up Pinecone (2 hours)
✅ Ingested 1000 PDFs (4 hours)
❌ Tested with real queries (disaster)
😱 "Maybe RAG doesn't work for construction?"

Then we built it right. Which is what this blog is about.

What We Were Up Against

When we started this project at Navatech, our construction clients were drowning in documents:

100,000+ safety manuals, method statements, and RAMS (Risk Assessment Method Statements)
Building codes that changed with each jurisdiction
Mixed formats: PDFs with hand-drawn diagrams, Word docs with track changes from 15 different reviewers, Excel sheets containing critical load calculations

Our mandate was clear: Build a system that could answer questions about these documents with 99.9% accuracy, handle thousands of queries per hour, and never, never make things up.

Chapter 1: The Document Preprocessing Nightmare

The Reality Check

Our first wake-up call came when we tried to process a seemingly simple PDF. It was a 200-page equipment manual with:

Tables that spanned multiple pages (with headers only on the first page)
Diagrams with text annotations scattered around them
Footnotes that referenced other footnotes
Scanned pages mixed with digital text

Traditional PDF parsers either crashed or produced gibberish. One memorable output turned a safety warning into a recipe for disaster by merging it with an unrelated table about temperature settings.

What Actually Worked

After testing 5+ document parsing libraries, we settled on a hybrid approach:

http://Unstructured.io for the heavy lifting—it understood document structure better than anything else we tried
BeautifulSoup for fine-grained control over the converted HTML/XML
Custom parsers for specific document types.

But here's the key insight: We converted everything to Markdown.

Why Markdown? Let me show you:

## Scaffold Erection Procedures

### Table 3.1: Load Limits by Platform Type
| Platform Type | Max Load (kg/m²) | Safety Factor |
|--————|——————|—————|
| Light Duty | 225 | 4:1 |
| Medium Duty | 450 | 4:1 |
| Heavy Duty | 675 | 4:1 |

**Critical**: Never exceed 75% of maximum rated load.

### Installation Requirements
- Competent person inspection required
- Base plates on firm foundation
- Cross bracing every 20 feet vertically

This format preserved the relationships between elements. A load table and its safety warnings stayed together. Cross-references to inspection requirements remained intact.

The Learning Moment

One client had a critical procedure split across three pages, with a diagram in the middle. Our first parser treated each page as a separate document. An employee following the AI's advice would have skipped crucial safety steps. That near-miss taught us: Context preservation isn't optional—it's everything.

Chapter 2: The Art and Science of Chunking

What Chunking Means (In Human Terms)

Imagine you're creating a set of index cards from a textbook. Each card needs to:

Make sense on its own
Not be so long that it's unwieldy
Not be so short that it's useless

That's chunking for RAG systems.

Our Journey Through Chunking Strategies

Attempt 1: The Naive Approach We started by splitting documents every 1000 characters. The results were... educational. We'd get chunks like:

"...and under no circumstances should you mix Chemical A with Chemical B as this will cause an expl"

The next chunk started with "osion." Not ideal for a safety manual.

Attempt 2: The Page-Based Method "Let's just use page boundaries!" we thought. Then we discovered that one client's PDFs had been created by scanning double-sided documents incorrectly. Page 1 was the cover, page 2 was the back of page 50, page 3 was page 2... You get the picture.

Attempt 3: The Breakthrough We realized documents have natural boundaries—headings, sections, paragraphs. Our tag-based semantic chunking with BeautifulSoup was born:

 from bs4 import BeautifulSoup, NavigableString
def semantic_chunk(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    chunks = []
    current_chunk = {
        'content': '',
        'metadata': {}
    }
    
    for element in soup.find_all(['h1', 'h2', 'h3', 'p', 'table', 'ul', 'ol']):
        if element.name in ['h1', 'h2', 'h3']:
            # Save previous chunk if it exists
            if current_chunk['content']:
                chunks.append(current_chunk)
            # Start new chunk with heading
            current_chunk = {
                'content': element.get_text(),
                'metadata': {
                    'heading_level': element.name,
                    'section': element.get_text()
                }
            }
        elif element.name == 'table':
            # Tables stay together with their section
            table_md = convert_table_to_markdown(element)
            if len(current_chunk['content']) + len(table_md) > MAX_CHUNK_SIZE:
                chunks.append(current_chunk)
                current_chunk = {'content': table_md, 'metadata': {'has_table': True}}
            else:
                current_chunk['content'] += '\n' + table_md
        else:
            # Add to current chunk
            current_chunk['content'] += '\n' + element.get_text()
    
    return chunks

The Table Problem (And Our Solution)

Tables were our nemesis. A 50-row table about chemical properties can't fit in a single chunk, but splitting it randomly loses meaning. Our solution:

Keep small tables intact (under 20 rows)
For large tables, split by logical groups (e.g., chemicals A-M, N-Z)
Always include headers in every chunk
Add context about what was omitted ("Rows 21-50 available in next chunk")

The Clever Bit: Chunk Small, Retrieve Big

Here's where we got smart. We chunk small (500 tokens) for precise retrieval, but we return the entire page to the LLM. Why?

In construction safety, context is everything:

A procedure's warnings might be 3 paragraphs away
Diagrams often explain the text
Exception clauses hide in footnotes

So our approach:

Index small chunks for accurate semantic search
Store page boundaries in metadata
Return full pages that contain the matching chunks

This way, when someone asks "How do I install guardrails?", they get the complete procedure, not just the paragraph mentioning guardrails.

Chapter 3: Embeddings - Teaching Computers to Understand Meaning

The Non-Technical Explanation

Embeddings are like creating a "meaning fingerprint" for text. Similar meanings get similar fingerprints. It's how the system knows that "fire extinguisher location" and "where to find emergency fire suppression equipment" are asking about the same thing.

Our Embedding Model Journey

Month 1-2: OpenAI's text-embedding-002

Pros: Reliable, well-documented
Cons: Struggled with technical jargon
Memorable failure: Thought "PCB disposal" (Polychlorinated Biphenyls) was about computer circuit boards

Month 3: Google's Embedding Models

Pros: Fantastic with technical content
Cons: Rate limits hit us every hour (not just peak times!)
The breaking point: "You have exceeded your quota" became our most common error message

Month 2-3: OpenAI's text-embedding-003

The goldilocks solution: Good enough performance, rock-solid reliability
Only 1% worse than Google on our benchmarks, but actually available when we needed it

The Jargon Problem

Construction sites have their own language:

"RAMS" (Risk Assessment Method Statements)
"PTW" (Permit to Work)
"SWMS" (Safe Work Method Statements)
"SWL" (Safe Working Load) vs "WLL" (Working Load Limit)

Standard embedding models had never seen these terms used correctly. Our solution? We created a glossary preprocessing step:

User query: "Where's the RAMS for working at height?"
Expanded query: "Where's the RAMS (Risk Assessment Method Statement) for working at height elevated work?"

This simple trick improved retrieval accuracy by 15%.

Chapter 4: Building the Data Pipeline (Or: How I Learned to Stop Worrying and Love Parallel Processing)

The Scale Challenge

Processing 100,000 documents sounds abstract until you do the math:

Average processing time per document: 3.6 seconds
Sequential processing time: 100 hours
Client's patience: 4 hours

Parallel Processing Adventures

Our first attempt at parallelization was... enthusiastic. We spawned 1000 threads and promptly crashed our servers. The Azure API started returning 429 errors (too many requests), and our monitoring dashboard looked like a Christmas tree.

Here's what actually worked:

# The sweet spot we found

WORKER_PROCESSES = 16
BATCH_SIZE = 100
RATE_LIMIT_DELAY = 0.1  # seconds between batches

# Process documents in controlled batches
with ProcessPoolExecutor(max_workers=WORKER_PROCESSES) as executor:
    for batch in document_batches:
        futures = [executor.submit(process_doc, doc) for doc in batch]
        results = [f.result() for f in futures]
        
        # Respect the API's feelings
        time.sleep(RATE_LIMIT_DELAY)
        upload_to_azure(results)

The Metadata Bug That Changed Everything

Three months in, we noticed something odd. Our retrieval was working fine—we were getting the right chunks back. But our relevance scores were terrible. Why?

After 12 hours of debugging, we found it: LlamaIndex was embedding our metadata along with the content. Every chunk was being embedded with 7,000 tokens of invisible text:

chunk_id: doc_12345_chunk_67
source_document: scaffold_safety_manual_v2.pdf
page_numbers: 45-47
last_modified: 2024-03-15
document_type: safety_manual
...actual 500 tokens of content here...

The fix was surprisingly simple:

def safe_get_content(self, metadata_mode=None) -> str:
    return self.text  # Return ONLY the text, ignore metadata

TextNode.get_content = safe_get_content  # Monkey patch

The impact was massive:

Embedding size: 7,000 tokens → 500 tokens
Embedding cost: Down 93%
Relevance scores: Up 10%
Confidence in results: Through the roof

By removing all that noise, our embeddings finally captured what actually mattered—the content.

Chapter 5: Retrieval - Finding Needles in a Digital Haystack

The Hybrid Approach

Think of retrieval like looking for a book in a library. You might:

Remember exact words from the title (sparse retrieval)
Remember what it was about (dense retrieval)

Best results? Use both.

Azure Search: Our Unexpected Hero

We initially wanted to build our own vector database. "How hard could it be?" (Narrator: It was very hard.)

Azure AI Search saved us months of work:

Handled both vector and keyword search
Scaled to millions of documents without breaking a sweat
Built-in security filters (crucial for enterprise use)

The Bug That Made Us Question Reality

Azure's Python SDK had a subtle bug in hybrid search. It would silently fail and return only keyword results. For weeks, we thought our embeddings were broken. The fix? Direct API calls with the full payload:

payload = {
    "search": query_text,  # Keyword search. "" for vector only
    "vectorQueries": [
        {
            "kind": "vector", 
            "vector": query_embeddings, 
            "fields": "embedding", 
            "k": top_k * 6,  # Get more candidates for reranking
            "exhaustive": True,
        }
    ],
    "filter": filter_expression,
    "queryType": "semantic",
    "semanticConfiguration": "default-config",
    "captions": "extractive",
    "answers": "extractive|count-" + str(top_k),
    "top": top_k 
}

response = requests.post(
    f"{endpoint}/indexes/{index}/docs/search?api-version=2023-11-01",
    headers={"api-key": api_key},
    json=payload
)

The exhaustive: True flag was crucial—it ensured we searched the entire index, not just a sample.

Document-Specific Queries

Users often wanted answers from specific documents:

"What does the crane manual say about wind limits?"
"Check the 2024 building code for foundation requirements"

We added semantic document filtering:

def handle_document_specific_query(query, documents_mentioned):
    # Use embeddings to find the most relevant document
    doc_embeddings = get_embeddings(documents_mentioned)
    
    # Semantic search for document names
    relevant_docs = semantic_search_documents(doc_embeddings)
    
    # Add filter to search
    filter_expression = f"document_name eq '{relevant_docs[0]}'"
    
    return search_with_filter(query, filter_expression)

This let users naturally reference documents without knowing exact filenames.

Chapter 6: Generation - Making the AI Actually Helpful

The Context Window Challenge

LLMs have a context limit—think of it as their "working memory." GPT-4o can handle about 8,000 tokens (roughly 6,000 words). Sounds like a lot until you're trying to include:

User's question
Conversation history
10 retrieved document chunks
System instructions

Our solution was ruthless prioritization:

def prepare_context(query, retrieved_chunks, history):
    # Start with essentials
    context = system_prompt + query
    remaining_tokens = MAX_TOKENS - count_tokens(context)
    
    # Add chunks by relevance until we run out of space
    for chunk in sorted(retrieved_chunks, by='relevance'):
        if count_tokens(chunk) < remaining_tokens:
            context += chunk
            remaining_tokens -= count_tokens(chunk)
    
    # Add recent history if there's room
    # ...

Query Understanding and Rewriting

Users don't always ask clear questions. Real examples from our logs:

"That thing about the chemical spill"
"What John sent last week about safety"
"The new procedure (not the old one)"

Our query rewriting pipeline:

Intent Detection: Is this a lookup, comparison, or clarification?
Entity Extraction: What documents, topics, or time periods?
Expansion: Add synonyms and related terms
Context Integration: Use conversation history

Example transformation:

Original: "That thing about the chemical spill"
After processing: "chemical spill response procedure incident protocol hazmat"

Generation Tuning: The Endless Quest for Zero Hallucinations

Even with perfect retrieval, LLMs can still get creative. We tested endless combinations:

# Our final generation parameters (after 100+ experiments)
generation_config = {
    "temperature": 0.1,  # Low for factual accuracy
    "top_p": 0.9,       # Some diversity, but not too much
    "frequency_penalty": 0.3,  # Reduce repetition
    "presence_penalty": 0.0,   # Don't force novelty
    "max_tokens": 2000,
    "model": "gpt-4o" 
}

Key learnings:

Lower temperature = More consistent, less creative
Frequency penalty helped with repetitive safety warnings
GPT-4o vs GPT-4.1 vs GPT-4.1-mini vs More

Multi-Language Support (Because Construction is Global)

Our sites operate worldwide. The solution:

Query Processing:

# Detect language
source_lang = detect_language(user_query)

# Translate to English for retrieval
english_query = translate_to_english(user_query)

# Search in English (all docs are in English)
results = search(english_query)

# Translate response back
response = generate_response(results)
translated_response = translate_to_language(response, source_lang)

Document Cleaning: During ingestion, we also removed any non-English text that accidentally made it into safety manuals (surprising how often this happened)

Chapter 7: Evaluation - Measuring What Matters

Building a Test Set (AKA: The Most Painful Month of My Life)

We needed to know: Is this thing actually working? Our evaluation approach:

Generated 1,000 synthetic questions using GPT-4o:"Based on this scaffold safety section, generate 5 questions a construction worker might realistically ask"
Manual annotation - Let me tell you about pain...

The manual annotation process:

Me and 10,000 mg of caffeine
Each query needed: correct document, correct section, acceptable answer
Took a week of mind-numbing work
Found errors in source documents (bonus outcome!)
Created gold standard that caught issues automated testing missed

Example annotation:

{
"query": "Maximum wind speed for crane operation?",
"correct_docs": ["crane_safety_manual_v3.pdf", "site_weather_policy.pdf"],
"correct_sections": ["Section 4.3", "Appendix B"],
}

Key Metrics That Mattered

Recall@K: Of the top K retrieved chunks, how many contain the answer?

Recall@3: 91% (great for focused questions)
Recall@10: 97% (catches edge cases)

Faithfulness: Is the generated answer supported by retrieved documents?

Measured using RAGAS framework
Our score: 94% (6% needed manual review)

Response Time:

P50: 2.1 seconds
P95: 3.8 seconds
P99: 5.2 seconds (usually complex multi-hop queries)

The Failure Analysis That Saved Us

Every wrong answer taught us something:

Failure Type 1: Temporal Confusion

Question: "What's the current procedure for waste disposal?"
System retrieved: Outdated 2019 procedure
Fix: Added "effective date" metadata and filtering

Failure Type 2: Partial Retrieval

Question: "Complete checklist for equipment startup"
System retrieved: Only items 1-5 of a 10-item list
Fix: Improved chunk boundary detection for lists

Failure Type 3: Acronym Confusion

Question: "PPE requirements for height work"
System confused: Personal Protective Equipment vs. some random engineering term
Fix: Context-aware acronym expansion with construction-specific dictionary

Chapter 8: Lessons Learned and Battle Scars

Technical Lessons

Start with data quality: Garbage in, garbage out. We spent 40% of our time on preprocessing.
Monitoring is not optional: We track everything:
- Query latency by component
- Retrieval relevance scores
- User feedback (thumbs up/down)
- API costs (those embeddings add up!)
Build for failure: Everything fails. APIs go down. Models return nonsense. Have fallbacks.
Test with real data early: Our synthetic tests missed edge cases real documents exposed.

Business Lessons

RAG is not a silver bullet: It solves hallucination but introduces complexity. Make sure the tradeoff is worth it.
User training matters: Even the best system fails if users don't know how to query it.
Incremental rollout saves lives: We started with one department, learned, adjusted, then expanded.
Cost modeling is crucial:
- Document conversion: ~₹5000/month for processing services
- Embedding costs: $0.13/million tokens × millions of chunks = real money
- Storage: ₹16,000/month for our Azure index
- LLM inference: ~$0.03 per query (more with conversation history)
- Query expansion and caching helped control costs
- At scale, this adds up quickly (but still cheaper than lawsuits)

Human Lessons

Document your decisions: Six months later, you won't remember why you chose that chunk size.
Celebrate small wins: First successful retrieval. First day without crashes. These matter.
Listen to users: Our best improvements came from user complaints, not our clever ideas.

Chapter 9: What's Next

The Immediate Roadmap

Performance Improvements (Already Done!):

Implemented intelligent caching for common queries
Considering dedicated vector database for sub-second responses
Smart cache invalidation when documents update

Better Understanding:

Fine-tune embeddings on construction terminology
Implement query intent classification
Multi-language support already live (query in Arabic, get answers from English docs)

Advanced Features:

Multi-document reasoning ("Compare scaffold procedures across all our sites")
Temporal queries ("What changed in crane regulations since last year?")
Graph RAG for discovering relationships between safety procedures (experimental)

The Dream Features

What's actually in our pipeline:

Voice Interface: "Hey NavBot, what's the lockout procedure for this equipment?" (Q2 2025)
Predictive Safety: "Based on these incident reports, similar accidents likely on rainy days"

Final Thoughts: Was It Worth It?

Three months. Countless late nights. More Python stack traces than I care to remember. Was building a RAG system from scratch worth it?

When I see a crane operator quickly verify wind speed limits before a critical lift—yes.

When our system prevents someone from using outdated scaffold procedures—absolutely.

When we can show an OSHA inspector exactly which manual section backs up our safety practices—without question.

Building a production RAG system is hard. Really hard. But in construction, where a single wrong answer could cost lives, it's the only responsible path forward.

To anyone embarking on this journey: Document everything. Test ruthlessly. Listen to your users. And remember—every bug you fix is one less 3 AM phone call about a workplace accident.

Acknowledgments

This project wouldn't have been possible without:

The Navatech engineering team who debugged alongside me
Our beta clients who patiently reported "weird results"
Approximately 1,247 cups of coffee

Resources and Code Samples

While I can't share our proprietary code, here are the open-source tools that made this possible:

Document Processing: Unstructured.io
Orchestration: LlamaIndex (with our patches)
Evaluation: RAGAS
Vector Store: Azure AI Search (or try FAISS/Chroma for open-source)

Have questions about building your own RAG system? Find me on LinkedIn, Email (lakshay.chhabra@navatech.ai) or drop a comment below. Always happy to help fellow engineers avoid the pitfalls we discovered the hard way.

Breaking Boundaries: Leveraging Web Assembly and Mediapipe for running SLM's offline on the Edge

Joinal Ahmed — Thu, 25 Apr 2024 05:28:18 GMT

In a world where access to crucial health and safety information can mean the difference between life and death, the need for universal accessibility knows no bounds. At NavaTech, we are driven by a singular mission: to ensure that everyone, regardless of their location or internet connectivity, has access to vital health and safety resources. Our innovative approach combines cutting-edge technology with a commitment to global safety. Leveraging the power of WebAssembly (WASM) and Mediapipe, we are reimagining health and safety and revolutionizing the way SLM's are deployed, enabling them to run offline on mobile devices. At Navatech Group, our startup is dedicated to making AI more accessible by breaking down communication barriers. Our mission centers on bridging the digital divide between urban centers and remote areas, ensuring that AI technology is available and beneficial to all, regardless of location or connectivity. By focusing on accessible AI, we are setting new global standards for ensuring safety and ease of access to advanced technologies. This commitment to accessibility not only fosters inclusivity but also enhances connectivity across diverse communities worldwide.

Google released Mediapipe LLM APU on March 7th 2024, and we had been trying to fast follow the innovation and make SLMs available offline for our users, In this blog, we will delve into the intricacies of our approach, exploring the challenges of ensuring health and safety information reaches every corner of the globe. We'll discuss the role of WebAssembly and Mediapipe in enabling offline SLM deployment on mobile devices and examine the technical aspects of how these technologies work together to achieve our mission. Furthermore, we'll highlight real-world scenarios where our offline edge solutions have made a tangible impact, from bustling city centers to remote sites with limited connectivity. Join us on this journey as we unlock the potential of technology to empower safety everywhere.

How LLM's are deployed today?

In a typical Large Language Model (LLM) deployment scenario, the LLM is hosted on a public cloud infrastructure like AWS or GCP and exposed as an API endpoint. This API serves as the interface through which external applications, such as mobile apps on Android and iOS devices or Web Serices, interact with the LLM to perform natural language processing tasks. When a user initiates a request through the mobile app, the app sends a request to the API endpoint using data, specifying the desired task, such as text generation or sentiment analysis.

The API processes the request, utilizing the LLM to perform the required task, and returns the result to the mobile app. This architecture enables seamless integration of LLM capabilities into mobile applications, allowing users to leverage advanced language processing functionalities directly from their devices while offloading the computational burden to the cloud infrastructure.

To overcome the limitations of relying on internet connectivity and ensure users have the flexibility and ease to interact with their safety copilot even in remote locations or locations where internet isn’t available like basements or underground facilities while safeguarding privacy, the optimal solution is to run Large Language Models (LLMs) on-device, offline. By deploying LLMs directly on users' devices, such as mobile phones and tablets, we eliminate the need for continuous internet access and the associated back-and-forth communication with remote servers. This approach empowers users to access their safety copilot anytime, anywhere, without dependency on network connectivity.

What are Small Language Models (SLMs) ?

Small Language Models (SLMs) represent a focused subset of artificial intelligence tailored for specific enterprise needs within Natural Language Processing (NLP). Unlike their larger counterparts like GPT-4, SLMs prioritize efficiency and precision over sheer computational power. They are trained on domain-specific datasets, enabling them to navigate industry-specific terminologies and nuances with accuracy. In contrast to Large Language Models (LLMs), which may lack customization for enterprise contexts, SLMs offer targeted, actionable insights while minimizing inaccuracies and the risk of generating irrelevant information. SLMs are characterized by their compact architecture, lower computational demands, and enhanced security features, making them cost-effective and adaptable for real-time applications like chatbots. Overall, SLMs provide tailored efficiency, enhanced security, and lower latency, addressing specific business needs effectively while offering a promising alternative to the broader capabilities of LLMs.

Why running SLM's offline at edge is a challenge?

Running small language models (SLMs) offline on mobile phones enhances privacy, reduces latency, and promotes access. Users can interact with llm-based applications, receive critical information, and perform tasks even in offline environments, ensuring accessibility and control over personal data. Real-time performance and independence from centralized infrastructure unlock new opportunities for innovation in mobile computing, offering a seamless and responsive user experience. However, running SLMs offline on mobile phones presents several challenges due to the constraints of mobile hardware and the complexities of running LLM tasks. Here are some key challenges:

Limited Processing Power: Mobile devices, especially smartphones, have limited computational resources compared to desktop computers or servers. SLMs often require significant processing power to execute tasks such as text generation or sentiment analysis, which can strain the capabilities of mobile CPUs and GPUs.
Memory Constraints: SLMs typically require a significant amount of memory to store model parameters and intermediate computations. Mobile devices have limited RAM compared to desktops or servers, making it challenging to load and run large language models efficiently.
Battery Life Concerns: Running resource-intensive tasks like NLP on mobile devices can drain battery life quickly. Optimizing SLMs for energy efficiency is crucial to ensure that offline usage remains practical without significantly impacting battery performance.
Storage Limitations: Storing large language models on mobile devices can be problematic due to limited storage space. Balancing the size of the model with the available storage capacity while maintaining performance is a significant challenge.
Update and Maintenance: Keeping SLMs up to date with the latest improvements and security patches presents challenges for offline deployment on mobile devices. Ensuring seamless updates while minimizing data usage and user inconvenience requires careful planning and implementation.
Real-Time Performance: Users expect responsive performance from mobile applications, even when running complex NLP tasks offline. Optimizing SLMs for real-time inference on mobile devices is crucial to provide a smooth user experience.

Enabling on-device LLM's with Mediapipe and Web Assembly

TensorFlow Lite revolutionized on-device ML in 2017, and MediaPipe expanded its capabilities further in 2019. Now, with the release of experimental MediaPipe LLM Inference API, developers can run Large Language Models (LLMs) entirely on-device. This breakthrough, supporting Web, Android, and iOS, facilitates integration and testing of popular LLMs like Gemma, Phi 2, Falcon, and Stable LM. This latest release marks a paradigm shift, empowering developers to deploy Large Language Models fully on-device across various platforms. This capability is particularly groundbreaking considering the substantial memory and compute requirements of LLMs, which can be over a hundred times larger than traditional on-device models. The achievement is made possible through a series of optimizations across the on-device stack, including the integration of new operations, quantization techniques, caching mechanisms, and weight sharing strategies. These optimizations ranging from weights sharing, XNNPack, and our GPU-accelerated runtime for efficient on-device LLM inference to custom operators, were crucial in balancing computational demands and memory constraints, ultimately enhancing performance across platforms.

But what is WebAssembly?

WebAssembly (Wasm) emerged as a game-changer, originally designed for web browsers to enable the execution of non-JavaScript code seamlessly. Its binary format, compatible with multiple programming languages, offers significant advantages: it runs on any operating system and processor architecture and operates within a highly secure sandbox environment. Beyond browsers, Wasm finds utility in cloud computing, enabling providers like AWS to lease Wasm runtimes to customers for serverless-style workloads. In the realm of AI, Wasm offers a compelling solution for resource-intensive tasks like generative AI. By efficiently time-slicing GPU access and ensuring platform neutrality, Wasm optimizes GPU usage and facilitates seamless deployment across diverse hardware environments. Advances such as the WebAssembly Systems Interface – Neural Networks (WASI-NN) standard further enhance its capabilities, promising a future where Wasm plays a pivotal role in democratizing access to AI-grade compute power and optimizing AI workloads.

WebAssembly offers a suite of advantages that significantly augment web development, distinguishing it as a prime solution for optimizing web applications. Its foremost attribute lies in speed, facilitated by its compact binary size, which expedites downloads, particularly on slower networks. Additionally, its statically typed nature and pre-compiled optimizations markedly accelerate decoding and execution processes compared to JavaScript. Moreover, its inherent portability ensures consistent performance across diverse platforms, bolstering overall user experience. Lastly, WebAssembly's flexibility empowers developers to compile code from multiple programming languages into a unified binary format, thereby capitalizing on existing expertise and codebases while reaping the performance benefits of WebAssembly, thus streamlining and enhancing web development endeavors.

Putting it all together - Offline LLM's on mobile devices and browser!

Mediapipe and WebAssembly (WASM) collaborate seamlessly to enable Large Language Models (LLMs) on both mobile devices and web browsers, revolutionizing accessibility to vital resources.

Leveraging Mediapipe's versatile ML pipeline support and WASM's platform-neutral execution environment, we've developed a robust solution that empowers users with offline access to LLMs for natural language processing tasks on nAI. Mediapipe's integration further enhances our capability by providing a streamlined framework for deploying and managing ML models, optimizing performance on resource-constrained mobile devices. Together, Mediapipe and WASM set a new standard for on-device ML inference, democratizing access to advanced language processing capabilities on a global scale, regardless of internet connectivity or device specifications.

Animated GIF showcasing Offline / Edge Capability

We are thrilled with the breakthrough capability, optimizations and the performance in today’s experimental release of the Offline LLM inference on nAI. This is just the start. Over 2024, we will expand to more custom models, offer broader conversion capabilities, on-device QnA, high level workflows, and much much more. Furthermore, our mobile team has pushed the boundaries of innovation by leveraging Flutter to make this groundbreaking technology to our users and make sure it is available at every hands possible . Their dedication and expertise have played a pivotal role in bringing this cutting-edge feature to our customers, ensuring a seamless and intuitive experience across mobile platforms.

Join Our Team of Innovators!

Are you a passionate developer seeking exciting opportunities to shape the future of technology? We're looking for talented individuals to join our dynamic engineering team at Navatech Group. If you're eager to be part of groundbreaking projects and make a real impact, we want to hear from you!

Send your resume to careers@navatechgroup.com and take the first step toward a rewarding career with us. Join Navatech Group today and be at the forefront of innovation!

Driving Performance: Using JMeter for Scalability and Resilience Testing

Bhargav Kadiya — Thu, 18 Apr 2024 05:19:22 GMT

In the dynamic world of construction, where safety and efficiency are paramount, Navatech emerges as a groundbreaking solution. Our innovative conversational health and safety platform empowers construction workers with instant access to vital information—all at the palm of their hands. Through a seamlessly integrated mobile app and cloud-based backend systems, Navatech revolutionizes the way construction professionals engage with critical safety protocols and procedural knowledge.

At Navatech, ensuring optimal performance and reliability of our platform is non-negotiable. That's where JMeter steps in as our trusted ally. By employing JMeter for rigorous load testing, we validate our products' scalability and resilience under real-world conditions. This strategic approach not only guarantees uninterrupted access to essential information but also underscores our commitment to safeguarding the well-being and productivity of construction workers worldwide. Join us as we delve deeper into how Navatech leverages JMeter to deliver unparalleled safety and efficiency in the construction industry.

How we use utilizes JMeter for performance testing?

Performance testing is a crucial aspect of our development process, and JMeter plays a central role in this endeavor. Among various types of performance testing, load testing stands out as essential. This method allows us to evaluate a system's performance under real-world conditions, assessing its ability to handle varying levels of user activity. Load testing serves a clear purpose: to determine the system's capacity under different loads. It provides valuable insights into how much stress a device or software can withstand while maintaining functionality as per user expectations. Additionally, load testing helps us identify the maximum operating capacity of our applications, ensuring that the current infrastructure is sufficient to support their intended usage. Moreover, it aids in determining the optimal number of concurrent users that our applications can effectively accommodate, contributing to an enhanced user experience and overall system efficiency.

Let’s understand JMeter and how it helps :

Apache JMeter stands tall as a Java-based open-source tool designed to meticulously analyze and gauge the performance of web applications. The essence of JMeter's functionality can be encapsulated in the following diagram.

Consider a live application accessed by numerous users simultaneously, each potentially generating multiple requests. This scenario underscores the critical importance of thoroughly testing applications under such conditions. With JMeter, we're empowered to craft intricate test scenarios, comprising multiple users, multiple requests, or a combination of both.

Let's delve into some practical examples to elucidate this concept further.

Suppose we have to test the load for 100 users who can concurrently use the API /api/v1/config/language-list for getting requests to fetch the languages list, Here's a step-by-step guide to load testing the API /api/v1/config/language-list for 100 concurrent users using Apache JMeter:

Go to the Apache JMeter folder >> bin folder >> open the batch file>>select the Test Plan.
Next, right click on the Test Plan and add a Thread Group.
Now Provide the number of Threads and ramp-up time, In the Provided example we have used 100 threads (users) with a ramp-up time of 1 seconds.

Right-click on Thread Group and select Sampler and to add HTTP Request and provide the details for IP, Request Type, and Path

Now Add the Listeners to View the Load Report. In the Provided Example we have used 2 Listeners named – View Results in Table, Summary Report and View Result tree
Now Execute the API Request from The RUN button in the Toolbar
Also we have tested all the APIs with different DB connection, Increase/Decrease DB connections, Increase/Decrease Pods and with Low to High AWS servers.

Results

Summary Report

View Results Tree

View Results in Table

Analyzing the Results

After the test is complete, you can analyze the results to identify any performance issues. JMeter provides various performance metrics such as throughput, latency, and error rate, which can be used to identify performance bottlenecks. You can also generate reports and graphs to visualize the test results.

Why Do We Use JMeter for Load Testing?

JMeter is a preferred choice for load testing due to several key advantages:

Cost-effective: Being an open-source tool, JMeter eliminates the need for licensing fees, making it an economical option for load testing.
Versatility: JMeter is not limited to just web applications; it supports performance testing across various application types, including web services, databases, LDAP, and shell scripts.
Platform independence: Its Java-based architecture ensures compatibility across different operating systems, providing flexibility in deployment.
Wide-ranging support: Beyond performance testing, JMeter caters to other non-functional testing requirements such as stress testing, web service testing, and distributed testing.
Efficient recording and playback: JMeter offers intuitive features for recording and playback, equipped with drag-and-drop functionality, streamlining the testing process and enhancing efficiency.
Customization: As an open-source tool, JMeter allows developers to customize its functionalities to suit specific testing needs, ensuring adaptability to diverse testing scenarios.
Robust community support: With an extensive library of tutorials and a vibrant community, JMeter users benefit from readily available resources and free plugins that augment analysis capabilities, fostering continuous improvement and innovation.

These compelling features collectively make JMeter a favored solution for load testing, empowering teams to conduct thorough performance assessments with ease and precision.

Conclusion

In this blog, we have explored the indispensable role of Apache JMeter in the realm of performance testing, highlighting its myriad benefits and functionalities. As an open-source tool, JMeter offers a cost-effective solution for evaluating the performance of various applications, including web services and databases. With its user-friendly interface and interactive features, JMeter simplifies the testing process, making it accessible to both novice and experienced testers alike. Its versatility extends to simulating heavy loads and providing comprehensive performance metrics, enabling organizations to identify and address potential bottlenecks effectively. Moreover, JMeter's reliability ensures that applications can handle large volumes of traffic without compromising performance, thereby enhancing user satisfaction and trust. In essence, JMeter emerges as a powerful ally for ensuring the optimal performance of applications in today's dynamic digital landscape.

Join Our Team of Innovators!

Send your resume to careers@navatechgroup.com and take the first step toward a rewarding career with us. Join Navatech Group today and be at the forefront of innovation!

Data Migration Across Pinecone Indexes: A Stepwise Guide

Joinal Ahmed — Sat, 24 Feb 2024 19:23:15 GMT

Pinecone, renowned for its vector database solutions, recently unveiled a groundbreaking serverless feature that has revolutionized workflows for many developers and data scientists. This innovative addition offers heightened flexibility and scalability, particularly when handling extensive vector datasets. However, alongside these benefits come new challenges, such as the migration of data between different Pinecone indexes. In this article, we'll delve into a Python script designed to streamline this process, ensuring efficiency and simplicity.

The introduction of Pinecone's serverless feature marks a significant advancement in vector data management, offering superior resource utilization and cost-effectiveness. Migrating data to a serverless index can streamline operations, particularly for projects dealing with large datasets. Pinecone serverless represents the next evolution of our vector database, boasting up to 50 times lower costs, intuitive usage (without requiring pod configuration), and enhanced vector-search performance across any scale. These advancements empower developers to deploy GenAI applications more seamlessly and rapidly.

The benefits of Pinecone serverless over pod-based indexes include:

Up to 50x Lower Costs: Separated pricing for reads and storage, usage-based billing, and more efficient indexing and searching contribute to significant cost savings.
Effortless Setup and Scalability: No complex configurations or storage limits to contend with; simply name your index, load data, and start querying.
Fast and Relevant Search Results: Pinecone serverless maintains functionality and performance comparable to pod-based indexes, supporting live updates, metadata filtering, hybrid search, and namespaces.

Pinecone offers a focused set of functionalities through its Data Plane, primarily centered around managing and querying vector data efficiently. These functions cater to various operations involved in handling vector data within Pinecone indexes.

The core operations provided by Pinecone's Data Plane include:

Operation	Method(s)	Description
Upsert Vectors	POST	Add or update vectors in the index.
Query Vectors	POST	Search for vectors similar to a given query.
Fetch Vectors	GET	Retrieve vectors from the index based on their IDs.
Update a Vector	POST	Modify an existing vector in the index.
Delete Vectors	POST, DELETE	Remove vectors from the index.
List Vector IDs	GET	Retrieve a list of vector IDs present in the index.
Get Index Stats	POST, GET	Fetch statistics related to the index.

While adopting serverless, you’ll probably need to migrate your existing data to new indexes. Pinecone offers robust functionalities for managing vector data, however migration capabilities out of the box are not provided.

To resolve this and achieve the migration functionality, developers can leverage a combination of Pinecone's existing functionalities to accomplish data migration tasks effectively, extracting vector IDs using the "List Vector IDs" operation, fetching corresponding vectors using the "Fetch Vectors" operation, and then upserting these vectors into the target index using the "Upsert Vectors" operation. By utilizing these operations in tandem, users can effectively migrate vector data between Pinecone indexes, albeit with a manual orchestration process.

This approach capitalizes on Pinecone's versatile API capabilities to achieve seamless migration while maintaining data integrity and efficiency. Pinecone's Data Plane API encompasses essential functionalities tailored for seamless manipulation and management of vector data. The following operations are integral to this API:

List vector IDs (GET):
Retrieve vector IDs from a serverless index namespace via a GET request to https://{index_host}/vectors/list. Optionally filter with a prefix parameter. Default returns 100 IDs sorted; adjust limit parameter for custom pagination. Pagination tokens for fetching subsequent batches provided in responses.

Fetch vectors (GET):
GET request to https://{index_host}/vectors/fetch retrieves vectors by their IDs from a specified namespace. Response includes vector data and metadata. Critical for accessing stored vector content.

Upsert vectors (POST):
POST request to https://{index_host}/vectors/upsert writes vectors into a designated namespace. Previous values are overwritten for existing IDs. Request body should contain an array of vector objects, with a batch limit of 100 vectors per request. Namespace parameter specifies target namespace.

Putting it all together

This Python script facilitates the migration of vector data from a source Pinecone index to a target Pinecone index. It utilizes the Pinecone library for managing the Pinecone indexes and performing operations such as querying and upserting vectors. The migration process is carried out in batches for efficiency.

Key Components:

Pinecone Initialization: The script initializes the Pinecone client with the provided API key and sets up configurations for both the source and target Pinecone indexes.
Function to Retrieve IDs from Index: The get_all_ids_from_index function fetches all vector IDs from the source Pinecone index. It iterates through each namespace in the index, querying for vector IDs until all vectors are collected.
Function to Query IDs: The get_ids_from_query function queries the source index for vector IDs using an input vector and namespace.
Vector Migration Function: The migrate_vectors function orchestrates the migration process. It first fetches all vector IDs from the source index using get_all_ids_from_index. Then, it iterates through each namespace and migrates vectors in batches. For each batch, it fetches vector data from the source index, prepares the data for upserting, and upserts it into the target index

Potential Use-Cases for the Script:

Migration to Serverless Indexes: Organizations transitioning to Pinecone's serverless indexes can utilize this script to seamlessly migrate their vector data from traditional indexes to serverless ones. By doing so, they can take advantage of improved scalability and cost-effectiveness offered by serverless infrastructure.
Index Optimization: Over time, as data evolves and usage patterns change, it may become necessary to optimize Pinecone indexes for better performance. This script can aid in the process of restructuring indexes, redistributing data, and optimizing storage to enhance query performance and resource utilization.
Backup and Redundancy: Maintaining backups and redundant copies of vector data is crucial for ensuring data resilience and disaster recovery preparedness. With this script, organizations can automate the process of creating backups by regularly migrating data to secondary Pinecone indexes located in different regions or environments.
Data Archiving: For regulatory compliance or historical analysis purposes, organizations may need to archive vector data while retaining the ability to access it when necessary. This script can facilitate the archival process by transferring data from active indexes to dedicated archival indexes, where it can be stored securely for long-term retention.
Performance Testing: Prior to deploying changes or updates to production environments, it's essential to conduct performance testing using realistic data scenarios. This script enables the creation of test environments by migrating subsets of production data to dedicated testing indexes, allowing for comprehensive performance evaluation without impacting production systems.

By leveraging this script in various scenarios, organizations can streamline their data management processes, optimize resource utilization, and ensure the reliability and availability of their vector data within the Pinecone ecosystem.

Note: Always ensure that you have the necessary permissions and backups before performing data migrations. It’s also recommended to test the script in a development environment before using it in production.

Explore more about Pinecone and its features at Pinecone’s Blog and our blog on how we leverage pinecone to build next gen agents for construction health and safety.

Join Our Team of Innovators!

Are you a passionate developer seeking exciting opportunities to shape the future of technology? We're looking for talented individuals to join our dynamic team at Navatech Group. If you're eager to be part of groundbreaking projects and make a real impact, we want to hear from you!

Send your resume to careers@navatechgroup.com and take the first step toward a rewarding career with us. Join Navatech Group today and be at the forefront of innovation!

Scaling LLM inference with Ray and vLLM

Joinal Ahmed — Mon, 15 Jan 2024 09:37:18 GMT

Large Language Models (LLM) are becoming increasingly popular in many AI applications. These powerful language models are widely used to automate a series of tasks, improve customer service, and generate domain-specific content among many other usecases. At Navatech, LLM's are core of our conversational health and safety platform, powering various agents providing in-context health and safety information to our users and delivering the right content.

However, serving these fine-tuned LLMs at scale comes with challenges. Those models are computationally consuming. Their sizes are much larger than the traditional microservices, making it hard to archive high throughput serving and low cold start scaling.

Continuous Batching to Rescue

Due to the large GPU memory footprint and compute cost of serving LLMs, ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are system-level batching optimizations that make 8-10x or more differences in real-world workloads.

One recent such proposed and widely used optimization technique is continuous batching, also known as dynamic batching, or batching with iteration-level scheduling. We experimented to see the performance optimization it brings in at a production workload. We will get into details below, including how we simulate a production workload, but to summarize our findings:

Up to 23x throughput improvement using continuous batching and continuous batching-specific memory optimizations (using vLLM).
8x throughput over naive batching by using continuous batching (both on Ray Serve and Hugging Face’s text-generation-inference).
4x throughput over naive batching by using an optimized model implementation (NVIDIA’s FasterTransformer).

GPUs are massively-parallel compute architectures, with compute rates (measured in floating-point operations per second, or flops) in the teraflop (A100) or even petaflop (H100) range. Despite these staggering amounts of compute, LLMs struggle to achieve saturation because so much of the chip’s memory bandwidth is spent loading model parameters. Batching is one way to improve the situation; instead of loading new model parameters each time you have an input sequence, you can load the model parameters once and then use them to process many input sequences. This more efficiently uses the chip’s memory bandwidth, leading to higher compute utilization, higher throughput, and cheaper LLM inference.

The industry recognized the inefficiency and came up with a better approach. Orca: A Distributed Serving System for Transformer-Based Generative Models is a paper presented in OSDI ‘22 tackles this problem. Instead of waiting until every sequence in a batch has completed generation, Orca implements iteration-level scheduling where the batch size is determined per iteration. The result is that once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.

Completing seven sequences using continuous batching. Left shows the batch after a single iteration, right shows the batch after several iterations. Once a sequence emits an end-of-sequence token, we insert a new sequence in its place (i.e. sequences S5, S6, and S7). This achieves higher GPU utilization since the GPU does not wait for all sequences to complete before starting a new one.

Reality is a bit more complicated than this simplified model: since the prefill phase takes compute and has a different computational pattern than generation, it cannot be easily batched with the generation of tokens. Continuous batching frameworks currently manage this via hyperparameter: waiting_served_ratio, or the ratio of requests waiting for prefill to those waiting end-of-sequence tokens.

PagedAttention and vLLM

PagedAttention is a new attention mechanism implemented in vLLM (GitHub). It takes inspiration from traditional OS concepts such as paging and virtual memory. They allow the KV cache (what is computed in the “prefill” phase, discussed above) to be non-contiguous by allocating memory in fixed-size “pages”, or blocks. The attention mechanism can then be rewritten to operate on block-aligned inputs, allowing attention to be performed on non-contiguous memory ranges.

This means that buffer allocation can happen just-in-time instead of ahead-of-time: when starting a new generation, the framework does not need to allocate a contiguous buffer of size maximum_context_length. Each iteration, the scheduler can decide if it needs more room for a particular generation, and allocate on the fly without any degradation to PagedAttention’s performance. This doesn’t guarantee perfect utilization of memory ( limited to under 4%, only in the last block), but it significantly improves upon wastage from ahead-of-time allocation schemes used widely by the industry today.

Altogether, PagedAttention + vLLM enable massive memory savings as most sequences will not consume the entire context window. These memory savings translate directly into a higher batch size, which means higher throughput and cheaper serving.

Production Environment -

We scaled the production setup we mentioned in our previous blog, and deployed the Falcon LLM in a EKS cluster running ray-serve and vLLM moving away from a managed SageMaker Endpoint,

Benchmarking results: Throughput

Based on our understanding of static batching, we expect continuous batching to perform significantly better when there is higher variance in sequence lengths in each batch. To test this, we run a throughput benchmark four times for static and continious batching, configured our model to always emit a per-sequence generation length by ignoring the end-of-sequence token and configuring max_tokens. We then use a simple asyncio Python benchmarking script to submit HTTP requests to our model server. The benchmarking script submits all requests in burst fashion, so that the compute is saturated.

The results are as follows:

Throughput in tokens per second of each framework as variance in sequence length increases.

What is most impressive here is vLLM. For each dataset, vLLM more than doubles performance compared to naive continuous batching. We have not analyzed what optimization contributes the most to vLLM performance the most, but we suspect vLLM’s ability to reserve space dynamically instead of ahead-of-time allows vLLM to dramatically increase the batch size.

We plot these performance results relative to naive static batching:

Our throughput benchmark results presented as improvement multiples over naive static batching, log scale.

Benchmarking results: Latency

Live-inference endpoints often face latency-throughput tradeoffs that must be optimized based on user needs. We benchmark latency on a realistic workload and measure how the CDF of latencies changes with each framework.

Similar to the throughput benchmark, we configure the model to always emit a specified amount of tokens specified per-request. We measure latencies at both QPS=1 and QPS=4 to see how the latency distribution changes as load changes.

Median generation request latency for each framework, under average load of 1 QPS and 4 QPS. Continuous batching systems improve median latency.

We see that while improving throughput, continuous batching systems also improve median latency. This is because continuous batching systems allow for new requests to be added to an existing batch if there is room, each iteration. But how about other percentiles? In fact, we find that they improve latency across all percentiles:

Cumulative distribution function of generation request latencies for each framework with QPS=1. Static batchers and continuous batchers have distinct curve shapes caused by the presence of iteration-level batch scheduling in continuous batchers. All continuous batchers perform approximately equally under this load; FasterTransformers performs noticeably better than static batching on a naive model implementation.

The reason why continuous batching improves latency at all percentiles is the same as why it improves latency at p50: new requests can be added regardless of how far into generation other sequences in the batch are. However, like static batching, continuous batching is still limited by how much space is available on the GPU. As your serving system becomes saturated with requests, meaning a higher on-average batch size, there are less opportunities to inject new requests immediately when they are received. We can see this as we increase the average QPS to 4:

Cumulative distribution function of generation request latencies for each framework with QPS=4. Compared to QPS=1, FasterTransformer’s distribution of latencies becomes more similar to static batching on a naive model. Both Ray Serve and text-generation-inference’s continuous batching implementations perform similarly, but noticeably worse than vLLM.

Anecdotally, we observe that vLLM becomes saturated around QPS=8 with a throughput near 1900 token/s. To compare these numbers apples-to-apples to the other serving systems requires more experimentation; however we have shown that continuous batching significantly improves over static batching by 1) reducing latency by injecting new requests immediately when possible, and 2) enable advanced memory optimizations (in vLLM’s case) that increase the QPS that the serving system can handle before becoming saturated.

Conclusion

LLMs present some amazing capabilities, and we believe their impact is still mostly undiscovered. We have shared how a new serving technique, continuous batching, works and how it outperforms static batching. It improves throughput by wasting fewer opportunities to schedule new requests, and improves latency by being capable of immediately injecting new requests into the compute stream. We are excited to see what people can do with continuous batching, and where the industry goes from here.

Join Our Team of Innovators!

Are you a passionate developer seeking exciting opportunities to shape the future of technology? We're looking for talented individuals to join our dynamic ML/DS team at Navatech Group. If you're eager to be part of groundbreaking projects and make a real impact, we want to hear from you!

Send your resume to careers@navatechgroup.com and take the first step toward a rewarding career with us. Join Navatech Group today and be at the forefront of innovation!

Enhancing Health & Safety with TII Falcon and Pinecone powered conversational agents

Joinal Ahmed — Mon, 11 Sep 2023 11:35:26 GMT

In our fast-paced world, technology continues to play an increasingly vital role in elevating health and safety standards across diverse industries. One of the most promising advancements in this arena is the emergence of conversational agents driven by artificial intelligence (AI). These agents offer real-time information, guidance, and support, making them indispensable tools for safeguarding health and safety in a wide array of environments. This blog post will delve into our approach to harnessing the power of TII's Falcon Large Language Model (LLM) and Pinecone Vector Database to craft tailored conversational agents for health and safety applications.

Conversational agents, often known as chatbots or virtual assistants, have evolved significantly from their humble text-matching based origins to sophisticated AI-driven solutions capable of comprehending natural language and context and semantics. This evolution has unlocked a plethora of possibilities for enhancing health and safety measures in various sectors, including manufacturing, healthcare, and construction.

Let's explore the capabilities of TII Falcon and Pinecone before diving into the conversational agent development process.

The Technology Innovation Institute in Abu Dhabi has introduced Falcon, a groundbreaking series of language models. The Falcon family comprises three base models: Falcon-180B, Falcon-40B and Falcon-7B. Falcon-180B sets a new state-of-the-art for open models. It is the largest openly available language model, with 180 billion parameters, and was trained on a massive 3.5 trillion tokens using TII's RefinedWeb dataset. The 180b parameter model currently leads the Open LLM Leaderboard, while the 7B model excels in its weight class.

Notably, Falcon-180B is an exceptional open source model, surpassing many closed-source counterparts in capabilities. This development presents significant opportunities for professionals, enthusiasts, and industries alike, paving the way for exciting applications. At Navatech, we've chosen to harness this locally developed Large Language Model (LLM) to construct conversational agents that deliver health and safety information to our users via mobile devices. We've deployed the 7B model on our AWS tenant as an operational SageMaker Endpoint, serving as the backbone for our conversational agents.

Considering that the Falcon model isn't initially tailored to the Health and Safety domain, we've taken the approach of creating an external knowledge base using Pinecone Vector DB. Vector databases are uniquely designed to manage the intricate structure of vector embeddings—dense numerical vectors representing text. These embeddings capture word meanings and semantic relationships, which are pivotal for machine learning applications. Pinecone indexes these vectors for efficient search and retrieval, making it an ideal tool for natural language processing and AI-driven applications.

Why use Pinecone with Falcon LLM? : Pinecone can quickly search for similar data points in a database by representing data as vectors.This makes it ideal for a range of use cases, including semantic search, similarity search for images and audio, recommendation systems, record matching, anomaly detection, and more. We use Pinecone to build NLP systems that can understand the meaning of words and suggest similar text (documents) based on semantic similarity, and then the Falcon model uses these documents to give the relevant information to the user and sometimes rewrite the answer to suit the query.

Building an Conversational agent with LLMs and Vector Database :

LLM+Vector DB for building conversational agents

In Stage 1 of our process, we begin by ingesting knowledge base sources into the vector store. This involves a series of steps, starting with the meticulous reading of documents, which may be in the form of PDF files within the notebook. Subsequently, we break down these documents into smaller, more manageable chunks, ensuring that we include relevant sections that provide context to the prompts. The next crucial step is the generation of embeddings for each of these chunked documents. These embeddings, which are vector representations, capture the semantic meaning of the text and are essential for our system's understanding. Finally, we add these document embeddings to the vector store, ensuring accessibility for similarity searches and retrievals.

In Stage 2, we pivot to user interaction with our model. This stage begins with the user providing a prompt or asking a question. We then generate a user prompt embedding, which helps in capturing the semantic essence of the user's input. Subsequently, our system searches the vector store, seeking the nearest embeddings representing documents closely related to the user's prompt. Upon retrieval, we extract the actual text from these embeddings, which serve as valuable contextual information and seamlessly integrate it with the user's prompt, enriching it. This enhanced user prompt is then dispatched to our Large Language Model (LLM). Lastly, the LLM processes the augmented prompt and furnishes a summarized response to the user, complete with references to the sources from our knowledge base, facilitating a comprehensive and informative interaction.

Now that we've got the hang of the two main pieces of our conversational agent, let's dive into how we put them together in the agent's architecture and see what they can really do in tandem.

HSE Conversational Agent architecture

In the diagram above, we've encapsulated the conversational agent, constructed using Langchain, within a Lambda function. When a user submits a query, the Langchain component initiates a call to the Pinecone vector index to retrieve relevant documents. These documents are passed to the Falcon model hosted as a SageMaker endpoint to obtain an accurate response. Subsequently, the response is formatted according to the user's query and returned to the user. To facilitate seamless integration with our web and mobile applications, we've exposed this Lambda function through an API Gateway.

Whats next ?

Our roadmap includes a significant effort to refine and adapt Falcon, our conversational AI model, to excel in the domain of health and safety (HSE). This involves a comprehensive process of fine-tuning, where we meticulously train Falcon to provide not just information but highly accurate and contextually relevant guidance to our users. To achieve this, we're committed to encompassing a broad spectrum of Health, Safety, and Environment (HSE) guidelines sourced from regulatory and compliance agencies worldwide. These agencies represent diverse regions and industries with unique HSE standards and regulations. Our ultimate aim is to transform Falcon into a global repository of HSE knowledge, making it an invaluable resource for users regardless of their geographical location or industry focus. By doing so, we're not only ensuring precision but also promoting compliance with HSE regulations on a global scale. Our commitment is driven by a passion for enhancing safety and well-being across industries and borders.

Join Our Team of Innovators!

Send your resume to careers@navatechgroup.com and take the first step toward a rewarding career with us. Join Navatech Group today and be at the forefront of innovation!