Embedding Models: Choosing the Right One

What Is an Embedding?

Before comparing models, you need a solid mental model of what an embedding actually is.

An embedding is a dense vector of floating-point numbers. When you embed the sentence “The cat sat on the mat”, you get something like:

[0.0231, -0.1847, 0.0934, 0.2201, ..., -0.0423]  # 1536 numbers

These numbers encode the meaning of the text. More precisely: texts with similar meanings will produce vectors that are close together in the 1536-dimensional space, as measured by cosine similarity.

The geometric intuition: Imagine each text as a point in a very high-dimensional space. “Dog” and “puppy” are placed near each other. “Dog” and “automobile” are far apart. “Paris” and “France” are close, in the same direction as “Berlin” and “Germany” are close — the relationship between related concepts is consistent across the space.

This is why semantic search works: when you embed “How do I reset my password?”, the resulting vector is close to the vector for “Steps to recover account access” even though those phrases share no words. The geometry captures meaning.

The MTEB Benchmark: How Embedding Models Are Compared

The Massive Text Embedding Benchmark (MTEB) is the standard evaluation for embedding models. It tests across 56 tasks including retrieval, classification, clustering, and semantic similarity — across 112 languages.

The retrieval score is most relevant for RAG. Here are the numbers for models you’ll actually use (as of early 2026):

Model	MTEB Retrieval Score	Dimensions	Max Tokens	Cost per 1M tokens	Runs Locally
OpenAI text-embedding-3-large	54.9	3072	8191	$0.13	No
OpenAI text-embedding-3-small	51.7	1536	8191	$0.02	No
Cohere embed-v3-english	54.5	1024	512	$0.10	No
Cohere embed-v3-multilingual	52.2	1024	512	$0.10	No
BGE-M3 (BAAI)	54.0	1024	8192	Free	Yes
all-MiniLM-L6-v2	41.0	384	256	Free	Yes

The key insight: BGE-M3 is competitive with OpenAI text-embedding-3-large at zero marginal cost. For any cost-sensitive or privacy-sensitive deployment, this is the most important number in that table.

Model 1: OpenAI text-embedding-3-small

The pragmatic default for most RAG projects.

Why it’s the standard starting point:

Fast: ~100ms per batch of 100 chunks
Cheap: $0.02/1M tokens — a 500-page document corpus costs about $0.05 to embed
High quality: 51.7 MTEB is genuinely strong for most domains
Dimensionality: 1536 dims balances quality and storage

# pip install langchain-openai
from langchain_openai import OpenAIEmbeddings
import os

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

# Embed a single query
query_vector = embeddings.embed_query("How do I configure SSO?")
print(f"Query vector: {len(query_vector)} dimensions")

# Embed a batch of documents (more efficient than one at a time)
texts = ["First document.", "Second document.", "Third document."]
doc_vectors = embeddings.embed_documents(texts)
print(f"Embedded {len(doc_vectors)} documents")

The dimensionality reduction feature: One clever feature of the text-embedding-3 family is that you can request fewer dimensions while retaining most of the quality. This saves storage and speeds up similarity search:

# Request 512 dimensions instead of 1536 — 3x smaller, ~95% of quality
embeddings_small = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=512
)

OpenAI achieves this via Matryoshka Representation Learning (MRL) — the first N dimensions of a full embedding are themselves a high-quality lower-dimensional embedding. You can safely truncate without random quality loss.

Model 2: OpenAI text-embedding-3-large

The quality upgrade when text-embedding-3-small isn’t enough.

When to upgrade:

Your RAGAS context precision score is below 0.65 with the small model
Your domain is highly technical and jargon-heavy
You’re building a system where retrieval quality is more important than cost

embeddings_large = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=1024  # Can reduce from default 3072
)

Cost calculation for a realistic corpus:

A medium-sized enterprise knowledge base: 2,000 documents × 500 tokens average = 1M tokens to embed.

text-embedding-3-small: 1M tokens × $0.02/1M = $0.02
text-embedding-3-large: 1M tokens × $0.13/1M = $0.13

Ingestion cost is not the concern — you pay it once. The ongoing cost is per-query embedding: embedding the user’s question with each search. At 10,000 queries/month:

small: 10,000 × 15 tokens average = 150K tokens → $0.003/month
large: same → $0.02/month

Neither is expensive. For most projects, the cost difference between small and large is irrelevant. Choose based on quality needs.

Model 3: Cohere embed-v3

Cohere’s embedding model has two notable advantages.

Advantage 1: Multilingual support

The embed-v3-multilingual model is specifically designed for multilingual retrieval. If your documents are in French, Spanish, Japanese, or German — or if users will ask questions in multiple languages — Cohere’s multilingual model significantly outperforms OpenAI’s models for cross-lingual retrieval.

Advantage 2: Input type specification

Cohere’s API lets you specify whether you’re embedding a query or a document, and it optimizes the embedding accordingly. This “asymmetric embedding” approach slightly improves retrieval quality:

# pip install langchain-cohere
from langchain_cohere import CohereEmbeddings

# For indexing documents
doc_embeddings = CohereEmbeddings(
    model="embed-english-v3.0",
    input_type="search_document"  # optimized for stored content
)

# For embedding queries
query_embeddings = CohereEmbeddings(
    model="embed-english-v3.0",
    input_type="search_query"    # optimized for questions
)

# IMPORTANT: Use the right input_type for each context
doc_vectors = doc_embeddings.embed_documents(chunks)
query_vector = query_embeddings.embed_query("How do I configure SSO?")

Note the 512-token context limit — if your chunks are larger than ~380 words, Cohere silently truncates them. Check your chunk sizes against this limit.

Model 4: BGE-M3 (Open Source, Runs Locally)

BGE-M3 from the Beijing Academy of AI (BAAI) is the most important open-source embedding model available.

Why it matters:

Free: No API cost, no per-token billing
Privacy: Your documents never leave your infrastructure
Competitive quality: 54.0 MTEB — comparable to OpenAI’s large model
Long context: 8192 token maximum context (much longer than Cohere’s 512)
Dense + Sparse: BGE-M3 can produce both dense (semantic) and sparse (BM25-style) vectors simultaneously — useful for hybrid search (covered in Lesson 7)

# pip install sentence-transformers
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
import torch

# Detect available hardware
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": device},
    encode_kwargs={
        "normalize_embeddings": True,  # required for cosine similarity
        "batch_size": 32               # process 32 texts at once
    }
)

# Works identically to OpenAI embeddings in LangChain
query_vector = embeddings.embed_query("How do I configure SSO?")
print(f"Vector dimensions: {len(query_vector)}")  # 1024

Performance benchmark on Apple M2 Pro:

First run: ~8 seconds (model downloads ~570MB, then loads)
Subsequent batches of 32 chunks: ~0.5 seconds
Compare: OpenAI API for 32 chunks: ~0.3 seconds

For local development, the performance difference is negligible. For production with GPU acceleration, BGE-M3 is faster than the OpenAI API.

Model 5: all-MiniLM-L6-v2 (Lightweight Baseline)

The fastest option when you’re on very limited hardware or need ultra-low latency:

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

embeddings = HuggingFaceBgeEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

Trade-offs: 384-dimensional embeddings, MTEB score of 41.0. Noticeably worse than the others on domain-specific or complex queries. Use only if you need the absolute minimum memory footprint (the model is ~22MB vs BGE-M3’s ~570MB).

Understanding Cosine Similarity

All retrieval based on embeddings uses cosine similarity to measure how close two vectors are:

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Test semantic relationships
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

pairs = [
    ("How do I reset my password?", "Steps to recover account access"),
    ("How do I reset my password?", "What are the billing options?"),
    ("cat", "kitten"),
    ("cat", "automobile"),
]

for text1, text2 in pairs:
    v1 = embeddings.embed_query(text1)
    v2 = embeddings.embed_query(text2)
    sim = cosine_similarity(v1, v2)
    print(f"{sim:.3f}: '{text1}' vs '{text2}'")

# Typical output:
# 0.847: 'How do I reset my password?' vs 'Steps to recover account access'
# 0.312: 'How do I reset my password?' vs 'What are the billing options?'
# 0.891: 'cat' vs 'kitten'
# 0.234: 'cat' vs 'automobile'

Values range from -1 to 1. In practice with modern embedding models:

0.85: highly related (likely the same concept)
0.70–0.85: related (similar domain or topic)
0.50–0.70: loosely related
< 0.50: unrelated

Your vector store retrieves the k chunks with the highest cosine similarity to the query vector.

The Consistency Rule: One Model for Everything

This cannot be overstated: you must use the exact same embedding model for both document indexing and query embedding.

Different embedding models are trained differently and produce vectors in incompatible spaces. Mixing models is like using a compass calibrated for magnetic north while your map uses true north — the directions look similar but everything is slightly off, and your navigation gets progressively worse.

# WRONG: indexing with one model, querying with another
indexing_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# ... index all documents ...

query_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # WRONG!
results = vectorstore.similarity_search("query", embedding=query_embeddings)

When you switch embedding models (e.g., upgrading from 3-small to BGE-M3), you must re-embed your entire corpus. There is no shortcut. Build your system with this in mind: make re-indexing a runnable script, not a one-off manual process.

The Decision Guide

Start here: OpenAI text-embedding-3-small
  Pros: Great quality, cheap, fast API, easy setup
  Use for: Most projects, quick prototyping to production

Need multilingual support?
  Yes → Cohere embed-v3-multilingual
  Note: Watch the 512 token limit

Need privacy / no API calls / want to reduce ongoing cost?
  Yes → BGE-M3 (local)
  Note: ~570MB model download, needs GPU for best performance at scale

Found quality insufficient after evaluation (RAGAS < 0.65)?
  Try → OpenAI text-embedding-3-large
  Or → BGE-M3 (often matches quality, zero cost)

Need minimum footprint for edge deployment?
  Try → all-MiniLM-L6-v2 (but accept quality degradation)

For the capstone project in this course, we use text-embedding-3-small. It produces excellent results for the LangChain documentation corpus, costs a few cents to index, and is the fastest to get running. If you’re building on a budget or need offline operation, substitute BGE-M3 with identical LangChain interface.

In the next lesson, we put these embeddings to work in actual vector databases.

Course Content

What Is an Embedding?

The MTEB Benchmark: How Embedding Models Are Compared

Model 1: OpenAI text-embedding-3-small

Model 2: OpenAI text-embedding-3-large

Model 3: Cohere embed-v3

Model 4: BGE-M3 (Open Source, Runs Locally)

Model 5: all-MiniLM-L6-v2 (Lightweight Baseline)

Understanding Cosine Similarity

The Consistency Rule: One Model for Everything

The Decision Guide

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies