From 0 to 2 Million Calls: Building Voice AI That Actually Scales

Everyone in voice AI talks about scale. Very few have actually done it.

We have. Breeze by Simpragma powers one of the largest AI voice agent deployments in India — 2 million calls per month for a leading microfinance company's collections operation, with 60M+ calls processed in total. We didn't start there. We started at zero, like everyone else.

The journey from proof-of-concept to 2 million calls taught us things that no whitepaper, conference talk, or vendor pitch will tell you. This article shares those lessons — the architecture decisions, the scaling challenges, and the non-obvious failures that almost killed the project.

If you're building or buying voice AI at scale, this is the article you need.

Phase 1: The Proof of Concept (0 to 10,000 Calls)

Every voice AI project starts the same way: someone builds a demo, it sounds impressive in a meeting room, and leadership says "let's pilot this."

The pilot is the easy part. And also the most dangerous — because what works at 10,000 calls will catastrophically fail at 1 million.

What We Built

A basic voice agent that could:

Call a borrower from a list
Deliver a payment reminder in Hindi
Understand simple responses ("yes," "no," "I'll pay tomorrow")
Log the outcome to a spreadsheet

What We Got Right

Started narrow. One use case (payment reminders), one language (Hindi), one borrower segment (1-15 days past due). Complexity is the enemy of pilots.
Used real borrower data. Not test scripts. Real loan amounts, real names, real due dates. This immediately exposed edge cases that synthetic data would have hidden.
Measured what mattered. Not "call completion rate" — that's a vanity metric. We measured promise-to-pay rate and compared it to human agents on the same segment.

What We Got Wrong

Used Twilio for telephony. Easy to set up. Expensive to scale. We'd pay for this decision later.
Ran everything on a single server. Fine for 100 concurrent calls. A disaster waiting to happen.
Underestimated Indian accent diversity. Our ASR model was trained primarily on "standard" Hindi. Borrowers in Bihar, UP, and Rajasthan speak Hindi very differently. Recognition accuracy dropped 15-20% for regional accents.

Result

The pilot showed promise. Promise-to-pay rates were 82% of human agent levels. Cost was 40% lower. Leadership greenlit the next phase.

Phase 2: Production Readiness (10,000 to 200,000 Calls)

This is where most voice AI projects die. The pilot worked. Now make it work reliably, at 20x the volume, with real production requirements.

The Architecture Shift

We threw away the pilot architecture and rebuilt for production:

Telephony: Migrated from Twilio to our own Asterisk-based SIP stack. This was the single most important decision we made. At 200K calls/month:

Twilio cost: ~$12,000/month in telephony alone
Own Asterisk stack: ~$2,000/month
Savings: $10,000/month — and growing linearly with volume

The tradeoff? We had to manage our own SIP trunks, handle failover, build monitoring, and deal with Indian telco quirks. It's not plug-and-play. But at scale, it's non-negotiable.

ASR (Speech Recognition): We moved from a generic model to fine-tuned models for Indian languages. Key decisions:

Separate models for Hindi, Tamil, Telugu, Kannada (not one multilingual model — accuracy is better with specialised models)
Custom acoustic models trained on real collection call recordings (with consent)
Streaming ASR for real-time processing (no waiting for the borrower to finish speaking)

LLM (Conversation Engine): Moved from a simple decision-tree to an LLM-powered conversation engine:

Intent classification with fallback logic
Context management (the agent remembers what the borrower said 30 seconds ago)
Objection handling flows trained on transcripts from top-performing human agents
Guardrails to prevent hallucination (critical in collections — you can't make up loan amounts)

TTS (Speech Synthesis): Upgraded from basic TTS to neural voices:

Natural prosody (the voice sounds human, not robotic)
Language-appropriate voices (a Hindi voice that sounds like an actual Hindi speaker, not a translated English voice)
Low latency (<200ms from text to audio) — critical for conversational flow

The Integration Layer

At production scale, the voice agent needs to talk to everything:

Loan Management System: Real-time borrower data (loan amount, EMI, overdue days, payment history)
Payment Gateway: Send payment links via SMS during or after the call
CRM: Log dispositions, schedule follow-ups, trigger escalations
Compliance Engine: Check time-of-day restrictions, DND lists, call frequency limits
Analytics Pipeline: Stream call data for real-time dashboards and reporting

Each integration is a potential failure point. We built circuit breakers, fallback modes, and degraded-operation paths for every external dependency. If the LMS is down, the agent can still make reminder calls with cached data. If the payment gateway is slow, the agent sends the link after the call instead of during.

The Reliability Problem

At 200K calls/month, you're making ~7,000 calls per day. Things that fail 0.1% of the time now fail 7 times daily. Things that fail 1% of the time fail 70 times daily.

We built:

Health monitoring for every component (ASR, LLM, TTS, telephony, integrations)
Automatic failover between SIP trunks (if one telco has issues, calls route to another)
Graceful degradation (if TTS is slow, use pre-recorded fallback audio for common phrases)
Call-level retry logic (if a call fails due to infrastructure issues, it re-queues automatically)

Result

200K calls/month running reliably. Cost per call dropped 55% vs human agents. Recovery rates within 8% of human baseline. Ready to scale.

Phase 3: Scale (200,000 to 2,000,000 Calls)

Scaling 10x is not doing the same thing 10 times bigger. It's a fundamentally different engineering challenge.

Concurrency

200K calls/month = ~300 concurrent calls at peak.
2M calls/month = ~3,000 concurrent calls at peak.

Every component had to be re-architected for 10x concurrency:

Telephony: Multiple SIP trunk providers, load-balanced across regions. Geographic routing (calls to Tamil Nadu borrowers route through Chennai SIP trunks for lower latency and cost). Capacity planning with burst headroom.

ASR/TTS: Horizontally scaled processing clusters. Pre-warming for anticipated peak hours (10 AM and 6 PM are collection call peaks in India). GPU allocation for neural TTS.

LLM: Model serving optimised for throughput, not just latency. Batched inference where possible. Model quantisation for cost efficiency without quality loss. Fallback to simpler models under extreme load.

Database: Time-series data for call analytics grew from manageable to massive. Moved to columnar storage for analytics queries. Separate operational and analytical databases.

The Campaign Engine

At 2M calls, you're not just "making calls." You're running campaigns:

Segmentation: Different conversation flows for different delinquency buckets (1-15 DPD gets a gentle reminder; 30-60 DPD gets a firmer tone with payment plan options)
Scheduling: Optimal call times vary by segment. Daily-wage workers: early morning. Salaried: evening. Self-employed: variable.
Retry logic: If a call doesn't connect, when do you retry? How many times? At what intervals? This is a combinatorial optimisation problem at scale.
Throttling: Telcos have rate limits. Regulators have frequency limits. You can't call the same borrower 5 times a day. The campaign engine manages all constraints.

Language Expansion

At pilot, we supported Hindi. At 2M calls, we needed:

Hindi (multiple dialects)
Tamil
Telugu
Kannada
Marathi
Bengali
Gujarati
Malayalam
Odia
Punjabi
Assamese
English (Indian)

Each language requires its own ASR model, TTS voice, and conversation flow. Some languages have limited training data. Some have complex morphology that challenges ASR. Building truly multilingual voice AI for India is one of the hardest problems in the space.

Cost at Scale

Here's what 2M calls/month costs with own infrastructure vs. Twilio:

Component	Own Stack	Twilio-Based
Telephony	$10,000	$60,000-80,000
ASR/TTS compute	$15,000	$15,000
LLM inference	$8,000	$8,000
Infrastructure	$5,000	$5,000
Total	$38,000	$88,000-108,000

Own stack saves $50,000-70,000 per month at this scale. Over a year, that's $600K-840K. This is why telephony architecture is the most important decision in voice AI at scale.

What Almost Killed Us

The 3 AM incident: A SIP trunk provider had an unannounced maintenance window. 40% of calls started failing. Our monitoring caught it in 8 minutes. Automatic failover kicked in within 12 minutes. But for those 12 minutes, hundreds of calls dropped. Lesson: always have N+1 redundancy on telephony, and never trust a single provider.

The dialect problem: We deployed Telugu in Telangana and Andhra Pradesh simultaneously. The dialects are different enough that our ASR model (trained primarily on Hyderabad Telugu) had 25% error rate in rural AP. We had to split the model and retrain. Lesson: "one language" is never one language in India.

The compliance scare: A borrower filed a complaint that the AI called them during restricted hours. Investigation showed the call was at 8:01 PM — one minute past the permitted window — due to a timezone handling bug. One minute. Lesson: compliance edge cases will find you. Build margins into every restriction.

Lessons for Technical Buyers

If you're evaluating voice AI platforms for scale deployment:

1. Ask About Telephony Architecture

If they're on Twilio, calculate your telephony cost at your target volume. Then add 50% for safety. If the number scares you, look for platforms with their own stack.

2. Demand Proof of Scale

"We can handle millions of calls" is easy to say. Ask for a reference customer at your target volume. Ask about their uptime. Ask about their worst incident.

3. Plan for Languages From Day One

Adding languages later is 10x harder than building for multilingual from the start. If you need Indian languages, make sure the platform has real (not demo-quality) support.

4. Build the Hybrid Model

AI handles volume. Humans handle complexity. The interface between them — the escalation logic, the handoff experience, the feedback loop — is where the real engineering challenge lives.

5. Invest in Monitoring

At scale, you need real-time visibility into every component. Call success rates, ASR accuracy, LLM latency, TTS quality, integration health. If you can't see it, you can't fix it.

6. Own Your Data

Every call generates data that makes the system better. ASR accuracy improves with more training data. Conversation flows improve with more examples. The platform that owns your data owns your competitive advantage. Make sure you own it.

Where We're Going

2 million calls/month is our current reality. But voice AI at scale is still early. Here's what's coming:

Real-time emotion detection to adjust conversation tone dynamically
Predictive dialling that optimises call timing based on historical connection patterns
Multi-modal interactions — starting a call, continuing on WhatsApp, completing with a payment link
Self-improving agents that learn from every call without manual retraining

The companies that build this infrastructure now will have an insurmountable data and cost advantage in 2-3 years. The ones that wait will be buying commodity tools at commodity prices.

Building voice AI at scale? Let's talk architecture.

→ Talk to our engineering team

Breeze by Simpragma: 60M+ calls processed. 2M+/month in production. Own telephony stack. 12+ languages. Solutions that launch in minutes.

→ See pricing | → Read: The Real Cost of AI Voice Agents in 2026

From 0 to 2 Million Calls: Building Voice AI That Actually Scales

From 0 to 2 Million Calls: Building Voice AI That Actually Scales

Phase 1: The Proof of Concept (0 to 10,000 Calls)

What We Built

What We Got Right

What We Got Wrong

Result

Phase 2: Production Readiness (10,000 to 200,000 Calls)

The Architecture Shift

The Integration Layer

The Reliability Problem

Result

Phase 3: Scale (200,000 to 2,000,000 Calls)

Concurrency

The Campaign Engine

Language Expansion

Cost at Scale

What Almost Killed Us

Lessons for Technical Buyers

1. Ask About Telephony Architecture

2. Demand Proof of Scale

3. Plan for Languages From Day One

4. Build the Hybrid Model

5. Invest in Monitoring

6. Own Your Data

Where We're Going

Ready to Get Started?