Bringing Your RAG System to Life - The Data Pipeline

Mother of AI Project, Phase 1: Week 2

and

Aug 14, 2025

Hey there! 👋

Week 1 was all about infrastructure first, AI second. We built the foundation - PostgreSQL, OpenSearch ready, Airflow dashboard looking great, Docker orchestration working like magic.

Missed Week 1? Catch up:
The Infrastructure That Powers RAG Systems
Shirin Khosravi Jam and Shantanu Ladhwe
·
Aug 7
Read full story

But here's the thing: our database had exactly zero rows.

Beautiful infrastructure. Perfect setup. Completely empty.

Sound familiar? This is the moment every AI engineer faces when they move beyond tutorials into the real world. You've got the pipes, but no water flowing through them.

This week, we changed that!

The Uncomfortable Truth About Data Pipelines

Let us be brutally honest: Most RAG tutorials are basically fraud.

They give you a nice, clean CSV/PDF file and say "now build RAG!" But in production? You're staring at:

APIs with rate limits that will ban you faster than you can think
PDFs that seem designed to break every parser known to humanity
Network timeouts, corrupt downloads, and the soul-crushing realization that data engineering is where dreams go to die

Data pipelines aren't sexy. They don't make for impressive demos. You can't screenshot a rate-limiting algorithm or put async error handling in your portfolio.

This isn't a bug - it's a feature. Data engineering IS AI engineering. The algorithms are just the cherry on top of a massive infrastructure sundae.

The Foundation Everything Else Depends On

Every successful AI system you've ever used - search engines, recommendation systems, language models - has industrial-strength data pipelines at its core.

Google's magic? Crawling and processing the entire web, reliably, every day
Netflix recommendations? Ingesting viewing data from millions of users in real-time
ChatGPT's intelligence? Processing terabytes of text data with perfect quality

The algorithms get the glory. The data pipelines do the work!

What We Built:
A Data Pipeline That Actually Works

After a week of pain, debugging, and learning, we transformed our beautiful-but-empty infrastructure into something magical: a self-sustaining research paper ingestion machine.

The Transformation

Before Week 2:

Beautiful infrastructure
Zero papers in database
Manual everything
Dreams of automation

After Week 2:

Automated arXiv Integration → Daily CS.AI paper fetching with bulletproof rate limiting
Scientific PDF Processing → Docling-powered extraction that actually understands academic papers
Production Airflow Workflows → Robust pipelines with retry logic and comprehensive error handling
Living Database → Rich, structured paper content growing automatically every day

The result?

Our system now wakes up every morning at 6 AM UTC, fetches the latest AI research papers, processes them, and adds structured content to our database.

Without any human intervention.

Zero to 100+ papers daily. Completely automated. Actual production grade!

This is what separates real systems from demo code.

The Architecture That Makes It Work

The Service Layer That Changes Everything

Look at what we added to our codebase:

src/services/
├── arxiv/                   # Smart arXiv API integration
│   ├── client.py            # Rate-limited, retry-aware client
│   └── factory.py           # Clean service instantiation
├── pdf_parser/              # Scientific PDF processing
│   ├── docling.py           # Docling integration for academic papers
│   ├── parser.py            # Abstract interface
│   └── factory.py           # Parser factory pattern
└── metadata_fetcher.py      # The orchestrator that ties together

This isn't just "call an API and save to database." This is production-grade service architecture with:

Factory patterns for clean dependency injection
Abstract interfaces for swappable implementations
Async processing for concurrent operations
Comprehensive error handling for real-world reliability

Why This Architecture Matters

Testable: Each service has clear interfaces and responsibilities
Scalable: Async processing handles concurrent operations elegantly
Maintainable: Factory patterns make swapping implementations trivial
Observable: Comprehensive logging and error handling throughout

Lesson 1: The rate limit with arXiv API

We're joking - this didn't happen to us. But this exact scenario happens to most engineers building their first data pipeline.

The pattern is always the same: Script works fine for a few requests → confidence builds → scale it up → 💥 BANNED 💥

arXiv isn't some random startup API you can hammer. It's a global scholarly repository serving researchers worldwide. They have rules: 3-second delays between requests, period.

Break them, and you're not just rate-limited, you're potentially disrupting access for scientists everywhere.

Lesson learned: Respect the services you depend on.

The arXiv Rules Are Non-Negotiable

The guidelines are crystal clear:

3-second delays between requests
Not 2.9 seconds
Not "usually 3 seconds"
EXACTLY 3 seconds

This isn't politeness - it's how you build systems that run for months without breaking. Our ArxivClient became a masterclass in respectful engineering:

✅ Request timestamp tracking (we know exactly when we last hit the API)
✅ Intelligent delays (if it's been 2.1 seconds, we wait 0.9 more)
✅ Exponential backoff (graceful handling when things go wrong)
✅ Smart caching (never download the same PDF twice)

Lesson 2: Scientific PDF Processing 📄💀

The reality: Scientific PDFs break every standard parser you've ever used.

Academic papers aren't just "text in PDFs", they're structured documents with mathematical notation, multi-column layouts, complex tables, and reference formatting that carries semantic meaning.

We tried PyPDF2, pdfplumber, and others. The results? Mathematical equations became gibberish, tables exploded into word clouds, and searching for "attention mechanisms" returned "ttention mchnsm" because characters randomly vanished.

Bad data quality doesn't just slow you down - it completely breaks your system.

Why We Chose Docling

After testing multiple PDF libraries, Docling stood out for production use:

✅ Consistent output → Same PDF, same parse, no surprises
✅ Scientific document understanding → Preserves structure, not just text
✅ Structured metadata → Separate section headers, tables, figures, references
✅ Production-ready → Easy Python package, clean tabular output
✅ Advanced features → OCR support, VLM integration for figures/charts
✅ Extendable → Works with other formats (HTML, DOCX), supports chunking

Future potential: Docling's VLM capabilities for figure understanding and chart recognition open up exciting possibilities for multimodal RAG systems.

Result? Our search can now find papers by table contents, figure descriptions, or specific equations. That's not just better extraction - that's intelligent document understanding.

There are so many features Docling offers including use of OCR and VLMs

Always evaluate tools with a long-term horizon. Will this adapt with your evolving use case?

Production Airflow: Beyond Hello World

Week 1 gave us Airflow running. Week 2 gave us real production workflows.

The Production DAG Architecture

Our arxiv_paper_ingestion DAG isn't a toy, it's enterprise-grade automation:

# Daily at 6 AM UTC - when arXiv publishes new papers
schedule="0 6 * * *"

# Production error handling
retries=2
retry_delay=timedelta(minutes=30)

# Prevent concurrent runs
max_active_runs=1

The Four-Stage Pipeline

Setup Environment: Verify all services and initialize caching
Fetch Daily Papers: Get yesterday's papers with proper rate limiting
Process Failed PDFs: Retry any parsing failures with different strategies
Generate Daily Report: Comprehensive statistics and monitoring

What makes this production-ready?

Error isolation: Individual paper failures don't break the entire pipeline
Retry logic: Intelligent handling of transient failures
Monitoring: Detailed logs and success metrics
Resource management: Controlled concurrency to avoid overwhelming systems

The Results: From Zero to Rich Content

After running our Week 2 pipeline, here's what our database looks like:

Rich, Structured Academic Content

Complete Metadata: Title, authors, abstract, categories, publication date
Full-Text Content: Parsed PDF content ready for search and retrieval
Structured Storage: Normalized database schema optimized for querying
Daily Updates: Fresh papers automatically ingested every morning

Production Performance Metrics

10-100+ papers per day: Configurable based on your needs
90%+ success rate: Robust error handling means high reliability
2-3 minute processing time: Docling is slow but really good (also high concurrency leads to high CPU usage)
Cross-platform reliability: Works on macOS, Linux, WSL, Ubuntu

The Hard-Won Lessons:
What Two Weeks of Reality Taught Us

Building this system taught us lessons you can't learn from tutorials. Here are the insights that will save you weeks of debugging:

1. Data Engineering Is 80% of AI Engineering

You might think that data engineering was the boring stuff we did before the real AI work. No:

In our RAG system:

80% of the complexity: Getting reliable data into the system
15% of the complexity: Making that data searchable and retrievable
5% of the complexity: The actual LLM generation everyone obsesses over

The ratio might actually be 90/10. Good data engineering is force multiplication for everything else.

2. Error Handling Is Half Your Code

Here's something they don't teach in bootcamps: in production systems, error handling often comprises more lines of code than the happy path.

Our arXiv client has:

20 lines for the basic API call
80 lines for rate limiting, retries, timeout handling, and graceful degradation

That's not bloat - that's the difference between a script that works once and a system that works for months.

3. Choose Domain-Specific Tools

We learned there's a hierarchy to tool selection:

Level 1: "This tool exists and does the thing we need"
Level 2: "This tool is reliable and well-maintained"
Level 3: "This tool is designed for our specific domain"

Most people stop at Level 1. We learned to prioritize Level 3, and it made all the difference. Docling isn't just a PDF parser - it's also super good scientific document understanding engine. That specificity matters.

4. Configuration Management Is Critical

There's a moment in every production system where you realize you need configuration management. Not for deployment, but for survival.

Our system has configurable:

Rate limiting delays
Retry counts and backoff strategies
Concurrency limits
Timeout thresholds
PDF cache locations
Database connection pools

This isn't over-engineering - it's what lets you tune the system for different environments without changing code.

5. Documentation Helps You Debug

We started writing documentation to help others use the system. We ended up using it to debug our own code.

When you're forced to explain how your rate limiting works, you discover the edge cases you missed. When you document your error handling strategy, you realize where the gaps are.

Good documentation isn't just communication - it's thinking made visible.

The Professional Difference: Engineering Discipline

Here's what separates weekend projects from production systems: engineering discipline.

What We Built vs. What Most People Build

Most RAG tutorials:

Single Python script that breaks if anything goes wrong
Manual data loading from CSVs
No error handling or monitoring
Works on developer's laptop, fails everywhere else

What we built:

Modular service architecture with clear interfaces
Automated data pipelines with comprehensive error handling
Production workflows with monitoring and retry logic
Cross-platform reliability that works in any environment

The Compound Effect

Every production decision we made in Week 2 will pay dividends in Week 3+. Clean architecture, robust error handling, and proper service boundaries make everything else easier.

Try It Yourself: Week 2 Implementation

Want to build your own production data pipeline? Here's how:

Quick Start

# Start the complete system
docker compose up --build -d

# Run the Week 2 interactive notebook
uv run jupyter notebook notebooks/week2/week2_arxiv_integration.ipynb

# Trigger the production pipeline
# Visit http://localhost:8080 and enable arxiv_paper_ingestion DAG

What You'll Master

Production API integration with rate limiting and error handling
Scientific document processing using specialized tools
Async Python patterns for concurrent processing
Airflow orchestration for reliable automation
Service architecture for maintainable systems

Resources and Next Steps

Week 2 Materials

Interactive Guide: Week 2 Notebook
Week 2 code: Github Code

Learn More

Week 1 Foundation: The Infrastructure That Powers RAG Systems
arXiv API: Official Documentation
Docling: Scientific PDF Processing

What's Coming in Week 3:
Making Data Discoverable

Week 2 gave us rich content. Week 3 will make it searchable and discoverable:

The Week 3 Goal: Hybrid Search

OpenSearch Integration: Real BM25 search replacing placeholder endpoints
Document Indexing: Structured content indexing for fast retrieval
Search API: REST endpoints for paper discovery and filtering
Query Processing: Advanced search with category and date filtering

The RAG Foundation is Taking Shape

✅ Infrastructure (Week 1): Rock-solid service foundation
✅ Data Pipeline (Week 2): Fresh academic content daily
🔄 Search Engine (Week 3): Making content discoverable
🔄 RAG Pipeline (Week 4+): Complete retrieval-augmented generation

Our foundation is now rock solid. Every day, 10-100 research papers flow into our system - parsed, structured, ready for intelligent querying. That's the compound interest of good engineering.

This is how you build AI systems that matter: one bulletproof component at a time, with respect for complexity and humility about what you don't know yet.

Follow Along: This is Week 2 of 6 in our Zero to RAG series.
Every Thursday, we release new content, code, and notebooks.

The data is flowing. The foundation is set.

Next week, we make it searchable. 💪

Subscribe to not miss the journey from infrastructure to production RAG system.