Endless Pastabilities is an encyclopedia of 123 pasta shapes, built by extracting and synthesizing knowledge from 8 of the world's best pasta books. Every description, shaping instruction, and historical note on this site comes directly from the source texts — nothing has been fabricated.

The project is an intentionally overengineered data science pipeline: we used NLP entity extraction, vector embeddings, semantic search, and multi-source knowledge assembly to turn 4.5 million characters of unstructured pasta literature into a structured, browsable encyclopedia.


Architecture Overview

+-------------------------------------------------------+
|     8 SOURCE BOOKS  (3 PDFs, 5 EPUBs)                 |
| 4.5 million characters of pasta knowledge             |
+---------------------------+---------------------------+
                            |
                            v
+-------------------------------------------------------+
| (1) TEXT EXTRACTION                                   |
| PyMuPDF + pdfplumber (PDF)                            |
| ebooklib + BeautifulSoup (EPUB)                       |
| Output: structured JSON per book                      |
+---------------------------+---------------------------+
                            |
                            v
+-------------------------------------------------------+
| (2) CHUNKING & EMBEDDING                              |
| ~800 char chunks with 150 char overlap                |
| sentence-transformers (all-MiniLM-L6-v2)              |
| --> 4,602 vectors in ChromaDB                         |
+------------+-----------------------------+------------+
             |                             |
             v                             v
+-------------------------+   +-------------------------+
| (3) ENTITY EXTRACT      |   | (4) ENTITY EXTRACT      |
| Pasta Shapes            |   | Dough Recipes           |
| Pattern + heuristic     |   | Semantic search +       |
| NLP extraction          |   | recipe parsing          |
| --> 319 candidates      |   | --> 10 canonical        |
| --> 123 curated         |   |     dough types         |
+------------+------------+   +------------+------------+
             |                             |
             v                             v
+-------------------------------------------------------+
| (5) KNOWLEDGE ASSEMBLY                                |
| Semantic search --> gather evidence                   |
| --> synthesize description, instructions              |
| All content grounded in source material               |
+---------------------------+---------------------------+
                            |
                            v
+-------------------------------------------------------+
| (6) STATIC SITE GENERATION                            |
| Astro --> 136 pages                                   |
| Minimal design, serif typography                      |
+-------------------------------------------------------+

Phase 1: Text Extraction

8 Books processed
4.5M Characters extracted
2 Extraction engines

The pipeline begins with raw book files — 3 PDFs and 5 EPUBs covering everything from Oretta Zanini De Vita's encyclopedic reference (the definitive scholarly work on Italian pasta shapes) to Vicky Bennison's Pasta Grannies with its regional home-cooking traditions.

PDF extraction uses a dual-engine approach: PyMuPDF handles primary text extraction (superior for general text layout), while pdfplumber runs in parallel to capture tabular data and structured layouts that PyMuPDF might flatten. The results are merged — PyMuPDF text with pdfplumber tables — giving us the best of both.

EPUB extraction uses ebooklib to unpack the EPUB container and iterate over document items, then BeautifulSoup to parse each chapter's XHTML into clean text. The HTML cleaner preserves structural hierarchy — headings become markdown-style markers, lists are maintained, and non-content elements (scripts, styles, navigation) are stripped.

Each book's output is a structured JSON file containing metadata (title, author, year, format) and an array of sections with their text content, page numbers (PDFs) or chapter IDs (EPUBs), and character counts.

Source File                Extraction Engine        Output
-----------------------    ----------------------   --------------------------
Encyclopedia.pdf      -->  PyMuPDF + pdfplumber  --> zanini_encyclopedia.json
Pasta By Hand.pdf     -->  PyMuPDF + pdfplumber  --> louis_byhand.json
A-Z of Pasta.epub     -->  ebooklib + BS4        --> roddy_atoz.json
Mastering Pasta.epub  -->  ebooklib + BS4        --> vetri_mastering.json
...                        ...                       ...

Phase 2: Chunking & Embedding

4,602 Semantic chunks
384 Embedding dimensions
~800 Chars per chunk

Raw book text isn't useful for retrieval — a single book might be 800,000+ characters. We need to break it into semantically meaningful chunks small enough for embedding models to handle, but large enough to preserve context.

Chunking strategy: Text is split at paragraph boundaries with a target size of ~800 characters and 150-character overlap between consecutive chunks. The overlap ensures that information spanning a chunk boundary isn't lost. When a single paragraph exceeds the chunk size (common in encyclopedic entries), we fall back to sentence-boundary splitting.

Embedding: Each chunk is embedded using sentence-transformers with the all-MiniLM-L6-v2 model — a compact (22M parameter) model that produces 384-dimensional vectors optimized for semantic similarity. Despite its small size, it performs remarkably well for this domain.

Storage: Chunks and their embeddings are stored in ChromaDB, a lightweight, file-based vector database. ChromaDB handles the ONNX runtime for the embedding model and provides efficient approximate nearest-neighbor search. The entire database is ~50MB on disk.

We validated the index with test queries:

Query: "orecchiette from Puglia"
----------------------------------------------------------
[0.764]  An A-Z of Pasta / Section 12
         "My friend never wants to go back to her home
          town in Puglia, but when I went..."

[0.856]  Pasta / Orecchiette with Duck
         "Orecchiette with Duck..."

[0.880]  An A-Z of Pasta / Section 12
         "Orecchiette -- Impara l'arte e mettila da
          parte, learn the art, and store it..."

Lower distance = higher relevance. The search correctly finds orecchiette content across multiple books, ranking the most relevant passages first.


Phase 3: Entity Extraction — Pasta Shapes

983 Candidate entities
319 After dedup
123 Curated shapes

The goal here is to answer: what pasta shapes exist in our corpus? We used a hybrid approach combining source-specific extractors with semantic enhancement.

Source-specific extractors were written for each book's structure. Zanini De Vita's Encyclopedia has a consistent entry format (shape name → also known as → description), so we pattern-matched headings. Roddy's A-Z is organized alphabetically with markdown-style headings. The Coastal Kitchen's encyclopedia has recipe-titled sections. Jenn Louis's Pasta By Hand has chapter-level shape names. Each extractor was tuned to its source.

Deduplication normalized names (lowercasing, accent handling, parenthetical removal) and merged entries from different sources. "Orecchiette" from five different books becomes one canonical entry with five source references.

Semantic enhancement used the ChromaDB vector store to find additional mentions of each shape across all books. For each candidate shape, we queried "{shape name} pasta shape" and tracked which books returned relevant results within a distance threshold of 1.2.

Curation was the critical final step. The raw 319 entities included noise — recipe names ("Lobster Fettuccine"), ingredient measurements, region names, and book metadata. We built a curated registry of 123 real pasta shapes with proper category classifications (hand-shaped, hand-cut, filled, extruded, sheet, dumpling, small).

Extraction Funnel
-----------------------------------------
983 candidates  (raw extraction from all sources)
 |
 +-- Deduplication & normalization
 v
382 unique names
 |
 +-- Semantic enhancement (cross-reference with vector DB)
 |
 +-- Noise filtering (recipes, ingredients, metadata)
 |
 +-- Category classification
 v
123 curated pasta shapes with categories

Phase 4: Entity Extraction — Dough Recipes

213 Dough-related chunks
10 Canonical dough types
8 Source books

Dough extraction used a semantic-search-first approach. We queried the vector store with 17 targeted queries like "basic egg pasta dough recipe flour eggs", "semolina water dough recipe no eggs", and "squid ink black pasta dough", collecting 213 relevant chunks.

Each chunk was classified into one of 10 canonical dough types using keyword scoring — a text mentioning "semolina" and "water" and "no egg" scores highest for the semolina-water category. The 10 types represent the fundamental dough traditions of Italian pasta making:

Standard Egg DoughThe northern Italian foundation — flour, eggs, olive oil
Rich Egg Yolk DoughAll yolks for golden, silky pasta like tajarin
Semolina & WaterThe eggless southern tradition — firm, rustic texture
Flour & WaterSimplest dough — for pici, umbricelli, and peasant pastas
BuckwheatAlpine tradition — for pizzoccheri
SaffronSardinian tradition — for malloreddus and lorighittas
SpinachVibrant green — for lasagne verde and tagliatelle
Squid InkDramatic black — Venetian and coastal tradition
Chestnut FlourSweet, delicate — Ligurian mountain tradition
Whole WheatNutty, hearty — rustic variation

Each dough type was then linked to the pasta shapes that traditionally use it, creating a bidirectional relationship: shapes reference their dough, doughs list their shapes.


Phase 5: Knowledge Assembly

5,272 Source chunks gathered
~43 Chunks per shape (avg)
100% Content coverage

This is where everything comes together. For each of the 123 curated pasta shapes, we ran a multi-query semantic search against the vector store, gathering every relevant passage from every book.

Query strategy: Four queries per shape, each targeting different aspects:

  1. "{shape name} pasta" — general information
  2. "{shape name} how to make shape dough" — shaping technique
  3. "{shape name} history origin region tradition" — cultural context
  4. "{shape name} ingredients recipe" — recipe details

With a distance threshold of 1.1, this gathered an average of 43 relevant chunks per shape — cross-referencing information from multiple books to build the richest possible picture.

Content synthesis then processed these gathered passages into three structured fields per shape:

  • Description — What the shape looks like and its character (1-2 sentences)
  • Instructions — Step-by-step shaping technique, synthesized from the most detailed source accounts (typically 5-8 steps)
  • History — Origins, etymology, regional traditions, and cultural significance, woven from details across multiple sources into a coherent narrative (2-4 sentences)

The result: 123 shapes, each with 100% coverage across description, instructions, and history. Every fact is grounded in the source material.

Example: Orecchiette
-----------------------------------------------------

Source evidence from 7 books:
  Encyclopedia of Pasta  -->  Entry #28, etymology, regional spread
  An A-Z of Pasta        -->  Angevin Provence connection, Arco Basso
  Pasta By Hand          -->  Detailed shaping technique, semolina dough
  Mastering Pasta        -->  Finger technique, texture notes
  Pasta Grannies         -->  Street-side making tradition in Bari
  Pasta (Theo Randall)   -->  Recipe with duck sauce
  Coastal Encyclopedia   -->  Additional recipe variations

Synthesized output:
  Description  -->  "Small, ear-shaped pasta with a thin center
                     and thicker rim..."
  Instructions -->  7 steps from dough prep to final shape
  History      -->  Angevin origins, Giambattista del Tufo (1500s),
                     women of Arco Basso in Old Bari

Phase 6: Static Site Generation

136 Pages generated
512ms Build time
0 JavaScript frameworks

The assembled knowledge base (a single JSON file) is consumed by Astro, a static site generator that produces plain HTML with zero client-side JavaScript by default. The entire site builds in under a second.

Design philosophy: Minimal, elevated, classic, refined. The site uses a warm off-white background (#fafaf8), a restrained palette of ink and muted tones, and serif typography (Iowan Old Style / Palatino) for body text with a clean sans-serif (Gill Sans) for labels and navigation. Generous white space and thin horizontal rules create visual breathing room.

Page types:

  • Homepage — All 123 shapes organized by category with a grid layout
  • Shape pages (×123) — Name, category tag, region, dough link, step-by-step instructions with numbered counters, and history with a drop-cap first letter
  • Dough pages (×10) — Full recipes with ingredient lists, method steps, tips, and links to shapes that use the dough
  • Dough index — Overview of all 10 dough types
  • This page — Technical deep-dive

By the Numbers

Source Books8 (3 PDFs, 5 EPUBs)
Raw Text Extracted4,498,875 characters
Semantic Chunks4,602
Vector Dimensions384 (all-MiniLM-L6-v2)
Candidate Entities983 pasta shapes identified
Curated Shapes123 with full content
Dough Recipes10 canonical types with full recipes
Content Coverage100% (description + instructions + history)
Generated Pages136
Build Time512ms
External APIs Used0 — entirely local pipeline

Tech Stack

Data Pipeline (Python)

PDF ExtractionPyMuPDF (primary text) + pdfplumber (tables)
EPUB Extractionebooklib + BeautifulSoup4 + lxml
Text ChunkingCustom paragraph-boundary splitter with overlap
Embeddingssentence-transformers / all-MiniLM-L6-v2 (via ONNX)
Vector DatabaseChromaDB (persistent, file-based)
Entity ExtractionCustom pattern matchers + heuristic classifiers
Knowledge AssemblyMulti-query semantic retrieval + content synthesis

Website

FrameworkAstro 6 (static site generation)
StylingScoped CSS with CSS custom properties
TypographySystem serif stack (Iowan Old Style / Palatino / Georgia)
JavaScriptZero (static HTML only)
Data FormatSingle JSON file consumed at build time

Pipeline Scripts

01_extract.pyText extraction from PDFs and EPUBs
02_chunk_embed.pyChunking, embedding, and ChromaDB storage
03_extract_shapes.pyPasta shape entity extraction and dedup
04_extract_doughs.pyDough recipe extraction and classification
05_assemble.pyPer-shape knowledge assembly via semantic search
06_gather_sources.pySource text gathering for content generation
07_merge_generated.pyFinal database assembly and site deployment

Sources

All content on this site is derived from these 8 books. We are deeply grateful to these authors for their scholarship and passion for the craft of pasta.

  • Encyclopedia of Pasta
    Oretta Zanini De Vita, translated by Maureen B. Fant (2009)
    The definitive scholarly reference — 300+ regional pasta shapes documented with historical sources.
  • An A-Z of Pasta
    Rachel Roddy (2021)
    Alphabetical journey through pasta shapes with personal narrative and Roman kitchen wisdom.
  • The Encyclopedia of Pasta: Over 350 Recipes
    The Coastal Kitchen / Cider Mill Press (2023)
    Comprehensive recipe collection with dough formulas and detailed technique instructions.
  • Mastering Pasta
    Marc Vetri & David Joachim (2015)
    A chef's deep-dive into flour science, dough technique, and the craft of handmade pasta.
  • Pasta Grannies
    Vicky Bennison (2019)
    Regional traditions from Italian grandmothers — the living memory of pasta making.
  • Pasta By Hand
    Jenn Louis (2015)
    Focused entirely on hand-shaped regional pasta with precise technique instructions.
  • Pasta and Noodles: A Global History
    Kantha Shelke (2016)
    Historical and cultural context for pasta's evolution across civilizations.
  • Pasta
    Theo Randall (2013)
    A London chef's love letter to Italian pasta with recipes organized by shape.