How We Built VocalDesk: An Honest Look at Building AI Phone Agents on Bland.ai + Twilio

By Sharath Challa9 min read
#AI Agents#Voice AI#Bland.ai#Twilio#Architecture#Phone Agents

There's a lot of marketing copy out there about "AI phone agents." Almost none of it is engineering. This post is the engineering version — what we actually built when we built VocalDesk, what stack we chose and why, what the third-party services actually do (and don't do), and the production layer you have to build yourself if you want something a real business will trust on its real phone numbers.

If you're evaluating Bland.ai, comparing it to Retell or Vapi, deciding whether to roll your own with Twilio + Deepgram + ElevenLabs, or just trying to understand what the inside of a real voice-agent product looks like — this is for you.

The decision: build the voice stack ourselves, or buy it?

When we started VocalDesk, the first real architectural choice was "do we own the voice infrastructure or do we rent it?"

Owning it looks like: Twilio Programmable Voice for telephony, Deepgram or AssemblyAI for streaming STT, an LLM (Claude, GPT-4) for the conversation logic, ElevenLabs or Cartesia for TTS, and a Node service stitching it all together with WebRTC or media streams. You write the turn-taking logic. You handle interruptions. You manage the streaming pipelines. You optimize for latency end-to-end.

Renting it looks like: pick a vertically integrated provider — Bland.ai, Retell, Vapi — that ships the entire pipeline and exposes a single API. You give up control of the voice stack to get a 30-day path to production.

We chose to rent. Specifically, Bland.ai for the voice runtime, Twilio for telephony (BYOT), and a custom application layer on top. Here's the math behind that choice and the reality behind the trade-offs.

What Bland.ai actually does

Bland is a vertically integrated voice stack. They run their own STT, their own LLM orchestration layer, and their own TTS, on dedicated GPU infrastructure they market as latency-optimized. Their developer surface is API-first:

  • A REST API to dispatch outbound calls and configure inbound numbers
  • Conversational Pathways — a visual node graph where you define the call flow (Default nodes for the LLM, Webhook nodes that hit your backend mid-call, Knowledge Base nodes, Transfer nodes, End Call nodes), with {{variable}} extraction and conditional transitions
  • Real-time webhooks (mid-call, your service responds and the response gets injected into the dialogue)
  • Post-call webhooks (transcript, recording URL, extracted variables fire to your endpoint when the call ends)

Pathways is the core authoring abstraction. If you've built chatbots, think of it as a state machine where each state is an LLM-driven node and the edges are conditions on extracted variables. It works for structured calls — appointment booking, lead qualification, simple support — and it falls down on open-ended conversations that need reasoning beyond a single node prompt.

What Bland gives you, all-in: voice in, voice out, a flow engine, telephony bundled or BYOC. What you avoid building: streaming STT, LLM orchestration with interruption handling, low-latency TTS, codec negotiation, dual-stream audio, voice activity detection, the entire mess.

Why Twilio (BYOT) instead of Bland's own numbers

Bland sells phone numbers natively. We didn't use them. We used Twilio numbers and connected them to Bland through their Bring Your Own Twilio (BYOT) flow:

  1. Generate Twilio Account SID + Auth Token in the Twilio Console
  2. POST them to Bland to get an encrypted_key (shown once)
  3. Import existing Twilio numbers into Bland via the Add-ons surface
  4. For outbound calls, include encrypted_key and a Twilio-owned from number in the call request
  5. For inbound, configure the imported number with the Authorization + encrypted_key headers

The reason for the extra step: Twilio is the most universally trusted carrier substrate in the developer world. Customer compliance teams know Twilio. Number portability through Twilio is well-trodden. Twilio's deliverability, A2P 10DLC registration, STIR/SHAKEN attestation, and DNC list integrations are mature in a way that's hard to verify on a young provider. If a customer wants their phone numbers to live somewhere they can move, Twilio is the answer.

The cost is one extra hop in the request flow and a small bit of credential plumbing. Worth it.

The latency reality

Bland's marketing positions them as "lowest latency on the planet." They don't publish a specific milliseconds number. Third-party reviews — including from competitors and from neutral developers — consistently put real-world turn-taking around 800ms average, with tail latency reaching ~2.5s.

This is fine for many use cases. It's not fine for all.

Conversational interruption handling — where the user starts talking over the agent — works but isn't crisp. Long-running webhook calls to your backend will be felt by the caller. If you push a real-time webhook node that hits a slow CRM, the silence on the line is real.

Practically, what we did:

  • Aggressively cache anything we'd otherwise lookup in real time (customer profile, recent ticket history, account status). Cache on call start; let the in-call webhook nodes hit the cache, not the source of truth.
  • Anything that takes more than ~300ms moves to a post-call workflow. The agent says "I'm getting that for you," logs the request, and the lookup happens after the call with a follow-up SMS or email.
  • Use Bland's wait_for_response semantics with realistic timeouts. A 1.5s timeout sounds fine and feels broken; 3-4s feels like a real conversation.

If your use case genuinely needs sub-300ms turn-taking — say you're building a real-time AI co-pilot on top of human calls — Bland is the wrong choice. Build it yourself.

What Bland doesn't do (and what VocalDesk had to build)

This is the part most evaluations skip. Pathways stops at the dial tone. Everything before and after a call is yours. Specifically:

Post-call workflows. When a call ends, Bland POSTs a payload — transcript, recording URL, extracted variables, call ID — to your webhook. From there, you own everything: queueing the work, classifying the outcome, syncing to a CRM, sending follow-ups, escalating. We use a job queue (BullMQ on Redis) and Postgres for everything that lives beyond the call.

Transcript and recording storage. Bland holds these transiently. For compliance (recording retention policies), search ("show me every call where the customer mentioned X"), and analytics, we mirror everything to S3 (recordings) and Postgres + pgvector (transcripts with embeddings).

Real RAG over a knowledge base. Bland's KB node accepts pasted text. Functional, not great. For real document grounding — PDFs, web docs, evolving knowledge bases with citations — we built our own retrieval layer. Chunking, embeddings, reranking, citation extraction. The agent calls a webhook node, which hits our retrieval API, which returns the answer with sources, which the agent reads back.

CRM and calendar integration. No native HubSpot, Salesforce, Google Calendar. The pattern: webhook node mid-call hits /api/customer-lookup or /api/book-appointment, which proxies to the actual integration and returns JSON we consume as variables. Building this generically — so customers can configure their own CRM without code — is most of the application work.

Analytics and QA. Bland exposes raw call data. Call success classification, sentiment scoring, agent grading, leaderboards, drill-downs by reason — all yours. We built this on top of the post-call webhook payloads with a separate analytics LLM pass.

Compliance. TCPA call windows, DNC scrubbing, consent logging, A2P registration if you do SMS follow-ups. All your responsibility. None of this is glamorous and all of it is the kind of thing that quietly kills a product if you don't get it right.

Authentication / identity verification. DTMF capture is supported, but the lookup ("verify this is who they say they are by checking the SSN-last-4 against our system") is a webhook node calling your service.

A/B testing. No native experimentation. We built a thin layer that routes calls to different pathway IDs based on a configurable split.

If you stack all of this up, the application layer ends up being roughly 70% of the engineering effort. Bland is the fast 30% that would otherwise be six months of voice infra.

What we'd do differently next time

Two things.

One: build a provider abstraction earlier. We started with Bland-specific webhook payloads flowing directly into our handlers. When we evaluated adding Retell as a fallback (different latency profile, different pricing model), we found Bland-shaped assumptions all over the application layer. We've since refactored to a VoiceProvider interface that normalizes call lifecycle events, transcript payloads, and webhook semantics across providers. Doing this on day one would have saved a chunk of work.

Two: treat pricing volatility as a real risk. Bland raised the Start plan from $0.09/min to $0.14/min in Dec 2025 (~55%). They charge per-minute on transfers, voicemails, and even no-answer outbound calls. None of this is unique to Bland — voice AI providers are still sorting out their pricing — but it means your unit economics need a buffer. We model gross margin assuming ~$0.20/min all-in, not the headline number.

The five-line summary

  1. Use a vertically integrated provider (Bland, Retell, Vapi) unless you have a specific reason to own the voice stack. The 30%/70% split between voice infra and application layer is real, and renting the 30% is almost always correct.

  2. Use Twilio for telephony even if your voice provider sells numbers. Customers trust Twilio. Number portability and compliance maturity matter.

  3. Build the application layer for portability. Provider abstraction on day one, not day 600.

  4. Treat the post-call layer as the product. That's where the business value lives — analytics, CRM sync, follow-ups, compliance, search. The call itself is a thin slice.

  5. Latency is lower than the marketing says. Plan for 800ms average. If you need lower, build it yourself or pick a different architecture.

We help startups ship voice agents that actually work

If you're building an AI phone agent product, V12 Labs has done this. We built VocalDesk on the architecture above, and we build custom voice agent systems for clients on the same foundations — Bland or Retell for the voice runtime, Twilio for telephony, a custom application layer that handles the 70% nobody talks about.

If you're trying to figure out whether to build it yourself, hire an agency, or buy a SaaS — book a call and we'll walk through the architecture before any code gets written. The right choice depends on your latency needs, your compliance constraints, your team's depth, and your willingness to own infrastructure.

The architecture above is the same regardless. Get it right and the implementation is just typing.