How We Built This
A technical deep-dive into the data pipeline behind Endless Pastabilities.
Endless Pastabilities is an encyclopedia of 123 pasta shapes, built by extracting and synthesizing knowledge from 8 of the world's best pasta books. Every description, shaping instruction, and historical note on this site comes directly from the source texts — nothing has been fabricated.
The project is an intentionally overengineered data science pipeline: we used NLP entity extraction, vector embeddings, semantic search, and multi-source knowledge assembly to turn 4.5 million characters of unstructured pasta literature into a structured, browsable encyclopedia.
Architecture Overview
+-------------------------------------------------------+
| 8 SOURCE BOOKS (3 PDFs, 5 EPUBs) |
| 4.5 million characters of pasta knowledge |
+---------------------------+---------------------------+
|
v
+-------------------------------------------------------+
| (1) TEXT EXTRACTION |
| PyMuPDF + pdfplumber (PDF) |
| ebooklib + BeautifulSoup (EPUB) |
| Output: structured JSON per book |
+---------------------------+---------------------------+
|
v
+-------------------------------------------------------+
| (2) CHUNKING & EMBEDDING |
| ~800 char chunks with 150 char overlap |
| sentence-transformers (all-MiniLM-L6-v2) |
| --> 4,602 vectors in ChromaDB |
+------------+-----------------------------+------------+
| |
v v
+-------------------------+ +-------------------------+
| (3) ENTITY EXTRACT | | (4) ENTITY EXTRACT |
| Pasta Shapes | | Dough Recipes |
| Pattern + heuristic | | Semantic search + |
| NLP extraction | | recipe parsing |
| --> 319 candidates | | --> 10 canonical |
| --> 123 curated | | dough types |
+------------+------------+ +------------+------------+
| |
v v
+-------------------------------------------------------+
| (5) KNOWLEDGE ASSEMBLY |
| Semantic search --> gather evidence |
| --> synthesize description, instructions |
| All content grounded in source material |
+---------------------------+---------------------------+
|
v
+-------------------------------------------------------+
| (6) STATIC SITE GENERATION |
| Astro --> 136 pages |
| Minimal design, serif typography |
+-------------------------------------------------------+ Phase 1: Text Extraction
The pipeline begins with raw book files — 3 PDFs and 5 EPUBs covering everything from Oretta Zanini De Vita's encyclopedic reference (the definitive scholarly work on Italian pasta shapes) to Vicky Bennison's Pasta Grannies with its regional home-cooking traditions.
PDF extraction uses a dual-engine approach: PyMuPDF handles primary text extraction (superior for general text layout), while pdfplumber runs in parallel to capture tabular data and structured layouts that PyMuPDF might flatten. The results are merged — PyMuPDF text with pdfplumber tables — giving us the best of both.
EPUB extraction uses ebooklib to unpack the EPUB container and iterate over document items, then BeautifulSoup to parse each chapter's XHTML into clean text. The HTML cleaner preserves structural hierarchy — headings become markdown-style markers, lists are maintained, and non-content elements (scripts, styles, navigation) are stripped.
Each book's output is a structured JSON file containing metadata (title, author, year, format) and an array of sections with their text content, page numbers (PDFs) or chapter IDs (EPUBs), and character counts.
Source File Extraction Engine Output ----------------------- ---------------------- -------------------------- Encyclopedia.pdf --> PyMuPDF + pdfplumber --> zanini_encyclopedia.json Pasta By Hand.pdf --> PyMuPDF + pdfplumber --> louis_byhand.json A-Z of Pasta.epub --> ebooklib + BS4 --> roddy_atoz.json Mastering Pasta.epub --> ebooklib + BS4 --> vetri_mastering.json ... ... ...
Phase 2: Chunking & Embedding
Raw book text isn't useful for retrieval — a single book might be 800,000+ characters. We need to break it into semantically meaningful chunks small enough for embedding models to handle, but large enough to preserve context.
Chunking strategy: Text is split at paragraph boundaries with a target size of ~800 characters and 150-character overlap between consecutive chunks. The overlap ensures that information spanning a chunk boundary isn't lost. When a single paragraph exceeds the chunk size (common in encyclopedic entries), we fall back to sentence-boundary splitting.
Embedding: Each chunk is embedded using sentence-transformers with the all-MiniLM-L6-v2 model — a compact (22M parameter) model that produces 384-dimensional vectors optimized for semantic similarity. Despite its small size, it performs remarkably well for this domain.
Storage: Chunks and their embeddings are stored in ChromaDB, a lightweight, file-based vector database. ChromaDB handles the ONNX runtime for the embedding model and provides efficient approximate nearest-neighbor search. The entire database is ~50MB on disk.
We validated the index with test queries:
Query: "orecchiette from Puglia"
----------------------------------------------------------
[0.764] An A-Z of Pasta / Section 12
"My friend never wants to go back to her home
town in Puglia, but when I went..."
[0.856] Pasta / Orecchiette with Duck
"Orecchiette with Duck..."
[0.880] An A-Z of Pasta / Section 12
"Orecchiette -- Impara l'arte e mettila da
parte, learn the art, and store it..." Lower distance = higher relevance. The search correctly finds orecchiette content across multiple books, ranking the most relevant passages first.
Phase 3: Entity Extraction — Pasta Shapes
The goal here is to answer: what pasta shapes exist in our corpus? We used a hybrid approach combining source-specific extractors with semantic enhancement.
Source-specific extractors were written for each book's structure. Zanini De Vita's Encyclopedia has a consistent entry format (shape name → also known as → description), so we pattern-matched headings. Roddy's A-Z is organized alphabetically with markdown-style headings. The Coastal Kitchen's encyclopedia has recipe-titled sections. Jenn Louis's Pasta By Hand has chapter-level shape names. Each extractor was tuned to its source.
Deduplication normalized names (lowercasing, accent handling, parenthetical removal) and merged entries from different sources. "Orecchiette" from five different books becomes one canonical entry with five source references.
Semantic enhancement used the ChromaDB vector store to find additional
mentions of each shape across all books. For each candidate shape, we queried
"{shape name} pasta shape" and tracked which books returned relevant results
within a distance threshold of 1.2.
Curation was the critical final step. The raw 319 entities included noise — recipe names ("Lobster Fettuccine"), ingredient measurements, region names, and book metadata. We built a curated registry of 123 real pasta shapes with proper category classifications (hand-shaped, hand-cut, filled, extruded, sheet, dumpling, small).
Extraction Funnel ----------------------------------------- 983 candidates (raw extraction from all sources) | +-- Deduplication & normalization v 382 unique names | +-- Semantic enhancement (cross-reference with vector DB) | +-- Noise filtering (recipes, ingredients, metadata) | +-- Category classification v 123 curated pasta shapes with categories
Phase 4: Entity Extraction — Dough Recipes
Dough extraction used a semantic-search-first approach. We queried the vector store
with 17 targeted queries like "basic egg pasta dough recipe flour eggs",
"semolina water dough recipe no eggs", and "squid ink black pasta dough",
collecting 213 relevant chunks.
Each chunk was classified into one of 10 canonical dough types using keyword scoring — a text mentioning "semolina" and "water" and "no egg" scores highest for the semolina-water category. The 10 types represent the fundamental dough traditions of Italian pasta making:
| Standard Egg Dough | The northern Italian foundation — flour, eggs, olive oil |
| Rich Egg Yolk Dough | All yolks for golden, silky pasta like tajarin |
| Semolina & Water | The eggless southern tradition — firm, rustic texture |
| Flour & Water | Simplest dough — for pici, umbricelli, and peasant pastas |
| Buckwheat | Alpine tradition — for pizzoccheri |
| Saffron | Sardinian tradition — for malloreddus and lorighittas |
| Spinach | Vibrant green — for lasagne verde and tagliatelle |
| Squid Ink | Dramatic black — Venetian and coastal tradition |
| Chestnut Flour | Sweet, delicate — Ligurian mountain tradition |
| Whole Wheat | Nutty, hearty — rustic variation |
Each dough type was then linked to the pasta shapes that traditionally use it, creating a bidirectional relationship: shapes reference their dough, doughs list their shapes.
Phase 5: Knowledge Assembly
This is where everything comes together. For each of the 123 curated pasta shapes, we ran a multi-query semantic search against the vector store, gathering every relevant passage from every book.
Query strategy: Four queries per shape, each targeting different aspects:
"{shape name} pasta"— general information"{shape name} how to make shape dough"— shaping technique"{shape name} history origin region tradition"— cultural context"{shape name} ingredients recipe"— recipe details
With a distance threshold of 1.1, this gathered an average of 43 relevant chunks per shape — cross-referencing information from multiple books to build the richest possible picture.
Content synthesis then processed these gathered passages into three structured fields per shape:
- Description — What the shape looks like and its character (1-2 sentences)
- Instructions — Step-by-step shaping technique, synthesized from the most detailed source accounts (typically 5-8 steps)
- History — Origins, etymology, regional traditions, and cultural significance, woven from details across multiple sources into a coherent narrative (2-4 sentences)
The result: 123 shapes, each with 100% coverage across description, instructions, and history. Every fact is grounded in the source material.
Example: Orecchiette
-----------------------------------------------------
Source evidence from 7 books:
Encyclopedia of Pasta --> Entry #28, etymology, regional spread
An A-Z of Pasta --> Angevin Provence connection, Arco Basso
Pasta By Hand --> Detailed shaping technique, semolina dough
Mastering Pasta --> Finger technique, texture notes
Pasta Grannies --> Street-side making tradition in Bari
Pasta (Theo Randall) --> Recipe with duck sauce
Coastal Encyclopedia --> Additional recipe variations
Synthesized output:
Description --> "Small, ear-shaped pasta with a thin center
and thicker rim..."
Instructions --> 7 steps from dough prep to final shape
History --> Angevin origins, Giambattista del Tufo (1500s),
women of Arco Basso in Old Bari Phase 6: Static Site Generation
The assembled knowledge base (a single JSON file) is consumed by Astro, a static site generator that produces plain HTML with zero client-side JavaScript by default. The entire site builds in under a second.
Design philosophy: Minimal, elevated, classic, refined. The site uses
a warm off-white background (#fafaf8), a restrained palette of ink
and muted tones, and serif typography (Iowan Old Style / Palatino) for body text
with a clean sans-serif (Gill Sans) for labels and navigation. Generous white space
and thin horizontal rules create visual breathing room.
Page types:
- Homepage — All 123 shapes organized by category with a grid layout
- Shape pages (×123) — Name, category tag, region, dough link, step-by-step instructions with numbered counters, and history with a drop-cap first letter
- Dough pages (×10) — Full recipes with ingredient lists, method steps, tips, and links to shapes that use the dough
- Dough index — Overview of all 10 dough types
- This page — Technical deep-dive
By the Numbers
| Source Books | 8 (3 PDFs, 5 EPUBs) |
| Raw Text Extracted | 4,498,875 characters |
| Semantic Chunks | 4,602 |
| Vector Dimensions | 384 (all-MiniLM-L6-v2) |
| Candidate Entities | 983 pasta shapes identified |
| Curated Shapes | 123 with full content |
| Dough Recipes | 10 canonical types with full recipes |
| Content Coverage | 100% (description + instructions + history) |
| Generated Pages | 136 |
| Build Time | 512ms |
| External APIs Used | 0 — entirely local pipeline |
Tech Stack
Data Pipeline (Python)
| PDF Extraction | PyMuPDF (primary text) + pdfplumber (tables) |
| EPUB Extraction | ebooklib + BeautifulSoup4 + lxml |
| Text Chunking | Custom paragraph-boundary splitter with overlap |
| Embeddings | sentence-transformers / all-MiniLM-L6-v2 (via ONNX) |
| Vector Database | ChromaDB (persistent, file-based) |
| Entity Extraction | Custom pattern matchers + heuristic classifiers |
| Knowledge Assembly | Multi-query semantic retrieval + content synthesis |
Website
| Framework | Astro 6 (static site generation) |
| Styling | Scoped CSS with CSS custom properties |
| Typography | System serif stack (Iowan Old Style / Palatino / Georgia) |
| JavaScript | Zero (static HTML only) |
| Data Format | Single JSON file consumed at build time |
Pipeline Scripts
| 01_extract.py | Text extraction from PDFs and EPUBs |
| 02_chunk_embed.py | Chunking, embedding, and ChromaDB storage |
| 03_extract_shapes.py | Pasta shape entity extraction and dedup |
| 04_extract_doughs.py | Dough recipe extraction and classification |
| 05_assemble.py | Per-shape knowledge assembly via semantic search |
| 06_gather_sources.py | Source text gathering for content generation |
| 07_merge_generated.py | Final database assembly and site deployment |
Sources
All content on this site is derived from these 8 books. We are deeply grateful to these authors for their scholarship and passion for the craft of pasta.
- Encyclopedia of Pasta
Oretta Zanini De Vita, translated by Maureen B. Fant (2009)
The definitive scholarly reference — 300+ regional pasta shapes documented with historical sources. - An A-Z of Pasta
Rachel Roddy (2021)
Alphabetical journey through pasta shapes with personal narrative and Roman kitchen wisdom. - The Encyclopedia of Pasta: Over 350 Recipes
The Coastal Kitchen / Cider Mill Press (2023)
Comprehensive recipe collection with dough formulas and detailed technique instructions. - Mastering Pasta
Marc Vetri & David Joachim (2015)
A chef's deep-dive into flour science, dough technique, and the craft of handmade pasta. - Pasta Grannies
Vicky Bennison (2019)
Regional traditions from Italian grandmothers — the living memory of pasta making. - Pasta By Hand
Jenn Louis (2015)
Focused entirely on hand-shaped regional pasta with precise technique instructions. - Pasta and Noodles: A Global History
Kantha Shelke (2016)
Historical and cultural context for pasta's evolution across civilizations. - Pasta
Theo Randall (2013)
A London chef's love letter to Italian pasta with recipes organized by shape.