Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

Document Ingestion and Parsing

Load PDFs, HTML, Word docs, and databases — handle messy real-world documents

Why Ingestion Is the Step That Breaks Most RAG Systems

Every RAG tutorial starts with a clean text file. Real life gives you something very different.

Real documents are a mess: PDFs with two-column layouts, embedded images, and tables that span multiple pages. Word documents with tracked changes and revision marks. HTML pages with navigation menus, cookie banners, and ads mixed in with the actual content. Scanned PDFs where the text layer is machine-printed OCR of dubious quality.

If your ingestion pipeline produces garbage, every step downstream — chunking, embedding, retrieval — amplifies that garbage. A well-engineered RAG system can be completely destroyed by bad parsing. This lesson shows you how to do it right.


The Metadata Principle: Never Lose Context

Before we look at specific formats, here is the single most important principle in document ingestion:

Always preserve and enrich metadata as you load.

Every chunk you create needs to remember where it came from: which file, which page, which section. You will need this information for two reasons:

  1. Citations: Users trust answers that say “Source: Q4 2025 Report, page 7” far more than answers that just appear.
  2. Filtering: Later you may want to retrieve only from documents updated in the last 30 days, or only from the “Legal” folder.

Good metadata looks like this:

{
    "source": "q4_2025_report.pdf",
    "page": 7,
    "section": "Revenue Breakdown",
    "file_type": "pdf",
    "created_at": "2025-11-15",
    "department": "Finance"
}

Add all the metadata you can at load time. It costs almost nothing and you cannot reconstruct it later.


Loading PDFs: PyPDF vs PyMuPDF

PDFs are the most common format in enterprise RAG systems — and the most treacherous. There are two serious Python options.

PyPDF (the common choice, often the wrong one)

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("contract.pdf")
pages = loader.load()

# Each page is a Document with page_content and metadata
print(pages[0].page_content[:300])
print(pages[0].metadata)
# {'source': 'contract.pdf', 'page': 0}

PyPDF is fast and easy. For simple text-heavy PDFs, it works well. But it has a significant weakness: it cannot handle tables.

When PyPDF encounters a table like this:

| Plan     | Monthly | Annual |
|----------|---------|--------|
| Starter  | $29     | $290   |
| Pro      | $99     | $990   |
| Enterprise | Custom | Custom |

…it extracts the text in reading order left-to-right, top-to-bottom, producing something like:

Plan Monthly Annual Starter $29 $290 Pro $99 $990 Enterprise Custom Custom

That single garbled string is now one of your chunks. When a user asks “What is the Pro plan monthly price?”, the embedding of that chunk has no clear relationship to the query. Retrieval fails.

PyMuPDF (the better default)

# pip install pymupdf
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("contract.pdf")
pages = loader.load()

# PyMuPDF preserves more structure and layout
print(pages[0].metadata)
# {'source': 'contract.pdf', 'page': 0, 'page_count': 45,
#  'author': 'Legal Team', 'creator': 'Word', 'producer': 'Adobe PDF'}

PyMuPDF (the library is called fitz internally) renders the PDF as it would appear visually, preserving text block order and table structure much better than PyPDF. It also extracts richer metadata from the PDF’s internal properties.

Rule: Default to PyMuPDF for PDFs. Use PyPDF only if you specifically need its simpler interface for text-only documents.

The Scanned PDF Problem

Both libraries fail completely on scanned PDFs — documents that are images of text, not actual text. A scanned PDF looks like a PDF but has no text layer; it’s just a picture.

For these, you need OCR (Optical Character Recognition):

# pip install pytesseract pillow pdf2image
# Also requires: brew install tesseract (macOS) or apt-get install tesseract-ocr (Linux)
import pytesseract
from pdf2image import convert_from_path
from langchain_core.documents import Document

def load_scanned_pdf(filepath: str) -> list[Document]:
    """Load a scanned PDF using OCR."""
    images = convert_from_path(filepath, dpi=300)
    documents = []
    
    for page_num, image in enumerate(images):
        # OCR the image
        text = pytesseract.image_to_string(image, lang='eng')
        
        # Clean up common OCR artifacts
        text = text.replace('\x0c', '')  # form feed characters
        text = ' '.join(text.split())    # normalize whitespace
        
        if text.strip():  # skip blank pages
            documents.append(Document(
                page_content=text,
                metadata={
                    "source": filepath,
                    "page": page_num,
                    "extraction_method": "ocr"
                }
            ))
    
    return documents

pages = load_scanned_pdf("scanned_legal_brief.pdf")

OCR quality varies significantly based on scan quality. Always check metadata["extraction_method"] in your evaluation pipeline — OCR-derived chunks often need different handling.


Loading HTML: Cleaning Web Pages

HTML documents contain a lot of noise: navigation menus, footers, cookie notices, social sharing buttons, and ads. You need to extract just the main content.

# pip install beautifulsoup4 lxml
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Target only the main content areas
loader = WebBaseLoader(
    web_paths=["https://docs.example.com/api-reference"],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("content", "main-content", "article-body")
        )
    }
)

docs = loader.load()

For bulk loading of documentation sites, iterate over a sitemap:

from langchain_community.document_loaders import SitemapLoader

loader = SitemapLoader(
    web_path="https://docs.example.com/sitemap.xml",
    filter_urls=["https://docs.example.com/"],  # only docs, not blog
)

docs = loader.load()
print(f"Loaded {len(docs)} pages")

A critical post-processing step for HTML: strip residual HTML tags that slip through, and normalize whitespace:

import re

def clean_html_document(doc):
    text = doc.page_content
    # Remove any leftover HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    doc.page_content = text
    return doc

cleaned_docs = [clean_html_document(d) for d in docs]

Loading Word Documents

Microsoft Word (.docx) files are common in enterprise settings. Use python-docx via LangChain’s loader:

# pip install python-docx
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("employee_handbook.docx")
docs = loader.load()

Word documents present a different challenge: they often use heading styles to organize content. Extracting headings as metadata is extremely valuable for chunking context:

from docx import Document as DocxDocument
from langchain_core.documents import Document

def load_docx_with_headings(filepath: str) -> list[Document]:
    """Load DOCX preserving heading structure in metadata."""
    docx = DocxDocument(filepath)
    documents = []
    current_heading = "Introduction"
    current_text = []
    
    for para in docx.paragraphs:
        style = para.style.name
        
        if style.startswith('Heading'):
            # Save accumulated text under previous heading
            if current_text:
                documents.append(Document(
                    page_content='\n'.join(current_text),
                    metadata={
                        "source": filepath,
                        "section": current_heading,
                        "heading_level": style
                    }
                ))
            current_heading = para.text
            current_text = []
        else:
            if para.text.strip():
                current_text.append(para.text)
    
    # Don't forget the last section
    if current_text:
        documents.append(Document(
            page_content='\n'.join(current_text),
            metadata={"source": filepath, "section": current_heading}
        ))
    
    return documents

sections = load_docx_with_headings("employee_handbook.docx")
for s in sections[:3]:
    print(f"Section: {s.metadata['section']}")
    print(f"Content preview: {s.page_content[:100]}")
    print()

This produces sections that are semantically coherent units. Even before chunking, you’ve created meaningful boundaries.


Loading Plain Text and CSV

For plain text files, keep it simple:

from langchain_community.document_loaders import TextLoader

loader = TextLoader("release_notes.txt", encoding="utf-8")
docs = loader.load()

For CSV files with structured data (FAQ databases, product catalogs), each row often makes a good document:

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(
    file_path="faq.csv",
    source_column="question",          # which column to use as source metadata
    metadata_columns=["category", "last_updated"]  # columns to keep as metadata
)

docs = loader.load()
# Each row becomes a Document
# page_content = all column values joined
# metadata = {"source": question_text, "category": ..., "last_updated": ...}

A Production Ingestion Pipeline

Here is a full ingestion pipeline that handles multiple file types, adds metadata, and handles errors gracefully:

import os
from pathlib import Path
from langchain_core.documents import Document
from langchain_community.document_loaders import (
    PyMuPDFLoader, WebBaseLoader, Docx2txtLoader, TextLoader, CSVLoader
)

def load_document(filepath: str) -> list[Document]:
    """Load any supported document type with error handling."""
    path = Path(filepath)
    extension = path.suffix.lower()
    
    try:
        if extension == '.pdf':
            loader = PyMuPDFLoader(filepath)
        elif extension in ('.docx', '.doc'):
            loader = Docx2txtLoader(filepath)
        elif extension in ('.txt', '.md'):
            loader = TextLoader(filepath, encoding='utf-8')
        elif extension == '.csv':
            loader = CSVLoader(filepath)
        else:
            print(f"Unsupported file type: {extension}")
            return []
        
        docs = loader.load()
        
        # Enrich metadata for all docs from this file
        for doc in docs:
            doc.metadata['filename'] = path.name
            doc.metadata['file_type'] = extension
            doc.metadata['file_size_kb'] = round(path.stat().st_size / 1024, 1)
        
        return docs
        
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return []

def ingest_directory(directory: str) -> list[Document]:
    """Recursively ingest all supported files in a directory."""
    all_docs = []
    supported = {'.pdf', '.docx', '.txt', '.md', '.csv'}
    
    for filepath in Path(directory).rglob('*'):
        if filepath.suffix.lower() in supported:
            docs = load_document(str(filepath))
            all_docs.extend(docs)
            print(f"Loaded {len(docs)} pages from {filepath.name}")
    
    print(f"\nTotal: {len(all_docs)} documents loaded from {directory}")
    return all_docs

# Usage
documents = ingest_directory("./company_docs/")

Quality Checks After Loading

Before you proceed to chunking, run these sanity checks:

def audit_documents(docs: list[Document]) -> None:
    """Print a quality report on loaded documents."""
    total = len(docs)
    empty = sum(1 for d in docs if not d.page_content.strip())
    very_short = sum(1 for d in docs if 0 < len(d.page_content) < 50)
    
    print(f"Total documents: {total}")
    print(f"Empty documents: {empty} ({100*empty/total:.1f}%)")
    print(f"Very short (<50 chars): {very_short} ({100*very_short/total:.1f}%)")
    
    # Check metadata completeness
    missing_source = sum(1 for d in docs if 'source' not in d.metadata)
    print(f"Missing 'source' metadata: {missing_source}")
    
    # Show a random sample
    import random
    sample = random.choice(docs)
    print(f"\nSample document:")
    print(f"  Metadata: {sample.metadata}")
    print(f"  Content preview: {sample.page_content[:200]}")

audit_documents(documents)

If empty is high, your loader is failing silently on some files — investigate those files individually. If very_short is high, you may have a lot of page headers, footers, or blank pages being loaded as separate documents. Filter them out before chunking.


The Key Takeaways

  1. Use PyMuPDF over PyPDF for most PDFs. It handles layout and tables better.
  2. Scanned PDFs require OCR — detect them early (a scanned PDF has no text layer) and route them through pytesseract.
  3. HTML needs aggressive cleaning — strip navigation, footers, and boilerplate before indexing.
  4. Preserve all metadata at load time: source, page, section, date. You cannot reconstruct it later.
  5. Audit after loading — check for empty documents, OCR failures, and truncated content before moving to chunking.

In the next lesson, you will learn how to split these documents into chunks — and why this decision has more impact on RAG quality than almost anything else.