Course Content
Embedding Models: Choosing the Right One
OpenAI, Cohere, and open-source embedding models — quality vs cost trade-offs
What Is an Embedding?
Before comparing models, you need a solid mental model of what an embedding actually is.
An embedding is a dense vector of floating-point numbers. When you embed the sentence “The cat sat on the mat”, you get something like:
[0.0231, -0.1847, 0.0934, 0.2201, ..., -0.0423] # 1536 numbersThese numbers encode the meaning of the text. More precisely: texts with similar meanings will produce vectors that are close together in the 1536-dimensional space, as measured by cosine similarity.
The geometric intuition: Imagine each text as a point in a very high-dimensional space. “Dog” and “puppy” are placed near each other. “Dog” and “automobile” are far apart. “Paris” and “France” are close, in the same direction as “Berlin” and “Germany” are close — the relationship between related concepts is consistent across the space.
This is why semantic search works: when you embed “How do I reset my password?”, the resulting vector is close to the vector for “Steps to recover account access” even though those phrases share no words. The geometry captures meaning.
The MTEB Benchmark: How Embedding Models Are Compared
The Massive Text Embedding Benchmark (MTEB) is the standard evaluation for embedding models. It tests across 56 tasks including retrieval, classification, clustering, and semantic similarity — across 112 languages.
The retrieval score is most relevant for RAG. Here are the numbers for models you’ll actually use (as of early 2026):
| Model | MTEB Retrieval Score | Dimensions | Max Tokens | Cost per 1M tokens | Runs Locally |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-large | 54.9 | 3072 | 8191 | $0.13 | No |
| OpenAI text-embedding-3-small | 51.7 | 1536 | 8191 | $0.02 | No |
| Cohere embed-v3-english | 54.5 | 1024 | 512 | $0.10 | No |
| Cohere embed-v3-multilingual | 52.2 | 1024 | 512 | $0.10 | No |
| BGE-M3 (BAAI) | 54.0 | 1024 | 8192 | Free | Yes |
| all-MiniLM-L6-v2 | 41.0 | 384 | 256 | Free | Yes |
The key insight: BGE-M3 is competitive with OpenAI text-embedding-3-large at zero marginal cost. For any cost-sensitive or privacy-sensitive deployment, this is the most important number in that table.
Model 1: OpenAI text-embedding-3-small
The pragmatic default for most RAG projects.
Why it’s the standard starting point:
- Fast: ~100ms per batch of 100 chunks
- Cheap: $0.02/1M tokens — a 500-page document corpus costs about $0.05 to embed
- High quality: 51.7 MTEB is genuinely strong for most domains
- Dimensionality: 1536 dims balances quality and storage
# pip install langchain-openai
from langchain_openai import OpenAIEmbeddings
import os
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.environ["OPENAI_API_KEY"]
)
# Embed a single query
query_vector = embeddings.embed_query("How do I configure SSO?")
print(f"Query vector: {len(query_vector)} dimensions")
# Embed a batch of documents (more efficient than one at a time)
texts = ["First document.", "Second document.", "Third document."]
doc_vectors = embeddings.embed_documents(texts)
print(f"Embedded {len(doc_vectors)} documents")The dimensionality reduction feature: One clever feature of the text-embedding-3 family is that you can request fewer dimensions while retaining most of the quality. This saves storage and speeds up similarity search:
# Request 512 dimensions instead of 1536 — 3x smaller, ~95% of quality
embeddings_small = OpenAIEmbeddings(
model="text-embedding-3-small",
dimensions=512
)OpenAI achieves this via Matryoshka Representation Learning (MRL) — the first N dimensions of a full embedding are themselves a high-quality lower-dimensional embedding. You can safely truncate without random quality loss.
Model 2: OpenAI text-embedding-3-large
The quality upgrade when text-embedding-3-small isn’t enough.
When to upgrade:
- Your RAGAS context precision score is below 0.65 with the small model
- Your domain is highly technical and jargon-heavy
- You’re building a system where retrieval quality is more important than cost
embeddings_large = OpenAIEmbeddings(
model="text-embedding-3-large",
dimensions=1024 # Can reduce from default 3072
)Cost calculation for a realistic corpus:
A medium-sized enterprise knowledge base: 2,000 documents × 500 tokens average = 1M tokens to embed.
- text-embedding-3-small: 1M tokens × $0.02/1M = $0.02
- text-embedding-3-large: 1M tokens × $0.13/1M = $0.13
Ingestion cost is not the concern — you pay it once. The ongoing cost is per-query embedding: embedding the user’s question with each search. At 10,000 queries/month:
- small: 10,000 × 15 tokens average = 150K tokens → $0.003/month
- large: same → $0.02/month
Neither is expensive. For most projects, the cost difference between small and large is irrelevant. Choose based on quality needs.
Model 3: Cohere embed-v3
Cohere’s embedding model has two notable advantages.
Advantage 1: Multilingual support
The embed-v3-multilingual model is specifically designed for multilingual retrieval. If your documents are in French, Spanish, Japanese, or German — or if users will ask questions in multiple languages — Cohere’s multilingual model significantly outperforms OpenAI’s models for cross-lingual retrieval.
Advantage 2: Input type specification
Cohere’s API lets you specify whether you’re embedding a query or a document, and it optimizes the embedding accordingly. This “asymmetric embedding” approach slightly improves retrieval quality:
# pip install langchain-cohere
from langchain_cohere import CohereEmbeddings
# For indexing documents
doc_embeddings = CohereEmbeddings(
model="embed-english-v3.0",
input_type="search_document" # optimized for stored content
)
# For embedding queries
query_embeddings = CohereEmbeddings(
model="embed-english-v3.0",
input_type="search_query" # optimized for questions
)
# IMPORTANT: Use the right input_type for each context
doc_vectors = doc_embeddings.embed_documents(chunks)
query_vector = query_embeddings.embed_query("How do I configure SSO?")Note the 512-token context limit — if your chunks are larger than ~380 words, Cohere silently truncates them. Check your chunk sizes against this limit.
Model 4: BGE-M3 (Open Source, Runs Locally)
BGE-M3 from the Beijing Academy of AI (BAAI) is the most important open-source embedding model available.
Why it matters:
- Free: No API cost, no per-token billing
- Privacy: Your documents never leave your infrastructure
- Competitive quality: 54.0 MTEB — comparable to OpenAI’s large model
- Long context: 8192 token maximum context (much longer than Cohere’s 512)
- Dense + Sparse: BGE-M3 can produce both dense (semantic) and sparse (BM25-style) vectors simultaneously — useful for hybrid search (covered in Lesson 7)
# pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
import torch
# Detect available hardware
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
embeddings = HuggingFaceBgeEmbeddings(
model_name="BAAI/bge-m3",
model_kwargs={"device": device},
encode_kwargs={
"normalize_embeddings": True, # required for cosine similarity
"batch_size": 32 # process 32 texts at once
}
)
# Works identically to OpenAI embeddings in LangChain
query_vector = embeddings.embed_query("How do I configure SSO?")
print(f"Vector dimensions: {len(query_vector)}") # 1024Performance benchmark on Apple M2 Pro:
- First run: ~8 seconds (model downloads ~570MB, then loads)
- Subsequent batches of 32 chunks: ~0.5 seconds
- Compare: OpenAI API for 32 chunks: ~0.3 seconds
For local development, the performance difference is negligible. For production with GPU acceleration, BGE-M3 is faster than the OpenAI API.
Model 5: all-MiniLM-L6-v2 (Lightweight Baseline)
The fastest option when you’re on very limited hardware or need ultra-low latency:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
embeddings = HuggingFaceBgeEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cpu"},
encode_kwargs={"normalize_embeddings": True}
)Trade-offs: 384-dimensional embeddings, MTEB score of 41.0. Noticeably worse than the others on domain-specific or complex queries. Use only if you need the absolute minimum memory footprint (the model is ~22MB vs BGE-M3’s ~570MB).
Understanding Cosine Similarity
All retrieval based on embeddings uses cosine similarity to measure how close two vectors are:
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Test semantic relationships
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
pairs = [
("How do I reset my password?", "Steps to recover account access"),
("How do I reset my password?", "What are the billing options?"),
("cat", "kitten"),
("cat", "automobile"),
]
for text1, text2 in pairs:
v1 = embeddings.embed_query(text1)
v2 = embeddings.embed_query(text2)
sim = cosine_similarity(v1, v2)
print(f"{sim:.3f}: '{text1}' vs '{text2}'")
# Typical output:
# 0.847: 'How do I reset my password?' vs 'Steps to recover account access'
# 0.312: 'How do I reset my password?' vs 'What are the billing options?'
# 0.891: 'cat' vs 'kitten'
# 0.234: 'cat' vs 'automobile'Values range from -1 to 1. In practice with modern embedding models:
0.85: highly related (likely the same concept)
- 0.70–0.85: related (similar domain or topic)
- 0.50–0.70: loosely related
- < 0.50: unrelated
Your vector store retrieves the k chunks with the highest cosine similarity to the query vector.
The Consistency Rule: One Model for Everything
This cannot be overstated: you must use the exact same embedding model for both document indexing and query embedding.
Different embedding models are trained differently and produce vectors in incompatible spaces. Mixing models is like using a compass calibrated for magnetic north while your map uses true north — the directions look similar but everything is slightly off, and your navigation gets progressively worse.
# WRONG: indexing with one model, querying with another
indexing_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# ... index all documents ...
query_embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # WRONG!
results = vectorstore.similarity_search("query", embedding=query_embeddings)When you switch embedding models (e.g., upgrading from 3-small to BGE-M3), you must re-embed your entire corpus. There is no shortcut. Build your system with this in mind: make re-indexing a runnable script, not a one-off manual process.
The Decision Guide
Start here: OpenAI text-embedding-3-small
Pros: Great quality, cheap, fast API, easy setup
Use for: Most projects, quick prototyping to production
Need multilingual support?
Yes → Cohere embed-v3-multilingual
Note: Watch the 512 token limit
Need privacy / no API calls / want to reduce ongoing cost?
Yes → BGE-M3 (local)
Note: ~570MB model download, needs GPU for best performance at scale
Found quality insufficient after evaluation (RAGAS < 0.65)?
Try → OpenAI text-embedding-3-large
Or → BGE-M3 (often matches quality, zero cost)
Need minimum footprint for edge deployment?
Try → all-MiniLM-L6-v2 (but accept quality degradation)For the capstone project in this course, we use text-embedding-3-small. It produces excellent results for the LangChain documentation corpus, costs a few cents to index, and is the fastest to get running. If you’re building on a budget or need offline operation, substitute BGE-M3 with identical LangChain interface.
In the next lesson, we put these embeddings to work in actual vector databases.
