Most AI MVPs use OpenAI or Anthropic as the core model, with either direct API calls or LangChain as the orchestration layer.
Getting this choice right saves you weeks. Getting it wrong means rebuilding your AI layer after you've already launched.
After integrating LLMs into 40+ products, here's what we've learned about when to use what — and the patterns that actually work in production.
Table of Contents
- Direct API Calls vs LangChain: The Real Difference
- When to Use Direct OpenAI or Anthropic API Calls
- When LangChain Is Worth the Overhead
- OpenAI vs Anthropic Claude: How We Choose
- The Prompt Engineering Patterns That Actually Work
- RAG Architecture for MVPs: When and How
- Streaming Responses: Why It Matters for UX
- Error Handling and Fallbacks
- Cost Management From Day One
- What We'd Do Differently
Direct API Calls vs LangChain: The Real Difference
The debate between direct API calls and LangChain is often framed as a complexity question. It's actually an abstraction question.
Direct API calls give you explicit control. Every token in, every token out. You know exactly what you're sending to the model and exactly what you're getting back. The code is simple, debuggable, and has no hidden behavior.
LangChain provides abstractions over common patterns: chains (sequences of LLM calls), agents (LLMs that can use tools), retrievers (document search and RAG), memory (conversation history management). These abstractions reduce boilerplate for complex patterns — but add opacity and overhead for simple ones.
The question isn't "which is better." It's "which abstraction level is right for this use case."
When to Use Direct OpenAI or Anthropic API Calls
Use direct API calls when:
Your AI workflow is a single LLM call. One input goes in, one output comes out, you're done. LangChain adds nothing here except dependencies and debugging complexity.
You need full observability. Direct API calls let you log exactly what's sent and received. With LangChain, intermediate calls can be harder to trace.
You're building something time-sensitive. Fewer abstractions = fewer things to debug at 2am before a launch.
The model choice might change. Direct API calls make it trivial to swap between OpenAI and Anthropic. Some LangChain abstractions couple more tightly to specific model providers.
Example use cases for direct API calls:
- Text summarization
- Email or copy generation
- Document classification
- Question answering on a single document
- Structured data extraction from text
For these, write a well-engineered prompt, call the API, parse the response. Done.
When LangChain Is Worth the Overhead
Use LangChain when:
You're building a RAG pipeline and need document loading, splitting, embedding, vector storage, and retrieval wired together. LangChain has mature abstractions for all of these.
You're building an agent that needs to use tools (web search, code execution, database queries) and you don't want to write the tool-calling loop from scratch.
You need conversation memory across multiple turns with automatic summarization of long conversations. LangChain's memory classes handle this.
You're chaining multiple LLM calls where the output of one is the input to the next, with conditional logic between them.
Example use cases for LangChain:
- RAG chatbot over a knowledge base
- Agent that can search the web and summarize findings
- Multi-step document processing pipeline
- Conversational AI with long-term memory
For these, LangChain's abstractions reduce meaningful complexity. The overhead is justified.
OpenAI vs Anthropic Claude: How We Choose
Both are excellent. Here's how we make the call:
We default to OpenAI GPT-4o when:
- The task is general purpose (writing, analysis, classification)
- We need function calling (OpenAI's function calling is mature and well-documented)
- Ecosystem tooling matters (most third-party integrations target OpenAI first)
- Image inputs are part of the workflow (GPT-4o Vision is excellent)
We choose Anthropic Claude when:
- The task involves long documents (Claude 3.5 Sonnet handles 200k tokens natively)
- We need nuanced instruction following (Claude follows complex system prompts more precisely in our experience)
- The task involves reasoning through ambiguity (Claude's reasoning quality on edge cases is strong)
- We want to reduce risk of model changes affecting production (Anthropic's API versioning is stable)
For voice AI and latency-sensitive applications: Neither. We use specialized models — Deepgram for speech-to-text, ElevenLabs or OpenAI TTS for speech synthesis — and wrap them with our own orchestration. Routing a voice call through a general-purpose LLM adds latency that makes the experience feel unnatural.
The Prompt Engineering Patterns That Actually Work
The quality of your AI output is almost entirely determined by the quality of your prompts. Here are the patterns we've found consistently work:
Pattern 1: Explicit output format specification
Don't let the model decide how to format its response. Specify it exactly.
Return your analysis as a JSON object with the following structure:
{
"summary": "One sentence summary",
"key_points": ["point 1", "point 2", "point 3"],
"confidence": "high|medium|low",
"reasoning": "Brief explanation of your analysis"
}
Return only the JSON object, no additional text.
Explicit output formats make parsing reliable and eliminate the need for complex post-processing.
Pattern 2: Role + context + task separation
Structure your system prompt in three sections:
- Role: Who the model is playing (domain expert, analyst, assistant)
- Context: What it knows about your specific situation
- Task: What it should do with the input it receives
You are a senior financial analyst specializing in early-stage startup evaluation.
Context: You're reviewing pitch decks for a pre-seed fund that focuses on B2B SaaS companies.
Task: For each pitch deck summary provided, identify the three strongest aspects of the business and the three most significant risks, focusing on market size, team, and product differentiation.
Separating these three elements improves output quality consistently.
Pattern 3: Few-shot examples for complex outputs
When the output format is complex or the task involves judgment, include 1-2 examples of ideal input/output pairs in the prompt.
Here are two examples of the analysis format:
Input: [example input 1]
Output: [example output 1]
Input: [example input 2]
Output: [example output 2]
Now analyze the following:
[actual input]
Few-shot examples dramatically improve output quality for tasks that require consistent judgment.
Pattern 4: Chain of thought for reasoning tasks
For tasks that require multi-step reasoning, ask the model to reason through the problem before giving the final answer.
Before providing your recommendation, work through the following:
1. What are the key factors relevant to this decision?
2. What does each factor indicate?
3. Are there any conflicts between factors?
4. Given your analysis, what is your recommendation and why?
Chain-of-thought prompting improves accuracy on reasoning tasks significantly.
RAG Architecture for MVPs: When and How
RAG (Retrieval-Augmented Generation) is the pattern for products where an AI needs to answer questions or generate content based on a specific document corpus.
When you need RAG:
- Your AI needs to answer questions about documents you provide
- The knowledge base is larger than what fits in a single context window
- You need the AI's answers to be grounded in specific source material
- The knowledge base will update over time
When you don't need RAG:
- The AI is generating content from scratch (no document grounding needed)
- Your documents are short enough to fit in a single prompt
- You need the AI to synthesize general knowledge, not specific documents
The minimal RAG architecture for an MVP:
- Document ingestion: Parse documents, split into chunks (500–1000 tokens each), generate embeddings using OpenAI's
text-embedding-3-small - Vector storage: Store embeddings in Supabase pgvector (for under 100k chunks) or Pinecone (for larger corpora)
- Retrieval: On each query, generate an embedding for the query and find the most similar document chunks using cosine similarity
- Generation: Include the retrieved chunks in the context window and generate a response grounded in the retrieved content
For MVP scale, Supabase pgvector handles this without additional infrastructure. Add Pinecone when you're scaling past 100k document chunks.
Streaming Responses: Why It Matters for UX
If your AI takes 5-10 seconds to generate a complete response, users will think something is broken.
Streaming solves this. Instead of waiting for the complete response before displaying anything, you stream tokens to the frontend as the model generates them. The user sees text appearing in real-time — which makes a 10-second generation feel fast rather than slow.
Both OpenAI and Anthropic support streaming responses. Next.js's streaming support makes this straightforward to implement.
For any AI workflow where the user is waiting for a response, streaming is not optional — it's a UX requirement.
Error Handling and Fallbacks
AI APIs fail. Rate limits hit. Models return unexpected outputs. Your product needs to handle all of this gracefully.
The minimum error handling setup for an AI MVP:
async function callAIWithFallback(prompt: string) {
try {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
timeout: 30000, // 30 second timeout
});
return response.choices[0].message.content;
} catch (error) {
if (error.status === 429) {
// Rate limit — wait and retry
await sleep(2000);
return callAIWithFallback(prompt); // retry once
}
if (error.status >= 500) {
// Server error — try fallback model
return callFallbackModel(prompt);
}
// Log and surface gracefully
logger.error("AI call failed", { error, prompt });
throw new Error("AI processing temporarily unavailable");
}
}
The patterns that matter:
- Set explicit timeouts on every API call
- Retry on rate limits (429) with exponential backoff
- Have a fallback (cheaper model or cached response) for server errors
- Never show raw API error messages to users
- Log everything for debugging
Cost Management From Day One
AI API costs compound. A product that processes 100 documents per day at $0.01 per document runs $30/month — manageable. The same product at 10,000 documents per day is $3,000/month — a significant cost of goods.
Cost controls to implement from Day 1:
Input token limits: Cap the size of inputs you send to the model. If a user uploads a 500-page document and you're charging $49/month, you can't afford to process all 500 pages on every query.
Caching: Cache AI responses for identical or near-identical inputs. A documentation chatbot will see the same questions repeatedly — cache the responses.
Model selection by task: Use GPT-4o-mini or Claude Haiku for simple classification and extraction tasks. Reserve GPT-4o and Claude Sonnet for complex reasoning. The cost difference is 10-20x.
Usage monitoring: Track tokens per user and per request from Day 1. Know your cost per user before you set your pricing.
What We'd Do Differently
After 40+ builds, here's what we've learned the hard way:
We'd invest more in prompt engineering upfront. The best time to tune your prompts is before you've built the rest of the product around them. We've seen teams build entire UIs around a specific output format, then discover the format needs to change after user testing. Prompt first, build around it second.
We'd add streaming earlier. It's always a feature request after the first demo. Build it from the beginning.
We'd implement cost monitoring on Day 1. It's easy to add and cheap to run. It saves you from discovering you're losing money on every user two months after launch.
We'd spend more time on error states. AI errors are more varied and less predictable than typical software errors. The graceful degradation patterns matter more than in traditional products.
If you're building an AI product and want to make sure the AI layer is designed correctly from the start, that's exactly what the Discovery Call is for.