🧠 Food Remedy Database Documentation

DB015 Documentation: For full schema, data flow, cart/recommendation dependencies and deployment checklist, see [Documents/Database/2026 Trimester 1/DB015-Schema-DataFlow-Documentation.md](Documents/Database/2026 Trimester 1/DB015-Schema-DataFlow-Documentation.md)

This document is the single place for how the database/ folder is organised, how data is processed (scrape → clean → enrich → seed), and where to find scripts and docs. No functionality is changed here—only documentation.

📄 Future docs: Save new database documentation in Documents/Database/[Year-Trimester].

Firebase / Firestore: Firebase Access

What the database folder is for

The database/ folder holds everything that prepares product data for the Food Remedy app:

Getting raw food product data (scraping).
Cleaning it so it is consistent and usable.
Enriching it with tags, scores, and categories.
Uploading it to Firestore (seeding).

So: raw data in → scripts in these folders turn it into clean, structured data → that data is sent to Firestore for the mobile app.

How data flows

Scraping  →  Clean  →  Enrich  →  Seed
   ↓           ↓          ↓         ↓
scraping/  clean_data/  pipeline/  seeding/

Scraping: Get Australian products from Open Food Facts.
Clean: Fix duplicates, names, units, and structure.
Enrich: Add nutrition scores, tags, categories (done in pipeline).
Seed: Upload the final data to Firestore.

The pipeline/ folder runs clean → enrich → seed in one go using pipeline.config.json. Optional Investigation (e.g. data_investigation/) is for exploring and validating data outside the main pipeline.

Folder-by-folder

Folder	What it does	Key files
scraping/	Gets raw Australian products from Open Food Facts.	`OpenFoodFacts-DataScrape.py`
clean_data/	Cleans and normalises product data (one canonical cleaning folder).	`cleanProductData.py`, `constants.py`, `normalization/`, `IOExamples/`
pipeline/	Runs clean → enrich → seed from config.	`run_pipeline.py`, `pipeline.config.json`, `stages/`, `modules/`
seeding/	Uploads product JSON to Firestore in batches.	`seed_firestore.py`, `seed_engine.py`, `seed_products.py`, `schema_definition.json`, product chunk files
Allergens/	Allergen reference data and detection.	`allergens_config.json`, `load_allergens.py`, `seed_allergens_to_db.py`, `test_allergens.py`
QA/	Quality assurance for cleaned data.	`DB006_QA_cleaning.py`, `summary_report.txt`, `errors.json`
Validation/	Validates product schema/rules before use and DB012 pre-seed checks.	`db021_validator.py`, `db012_validator.py`, `DB012-Validation-Integration-Testing.md`
Reports/	Generates validation/pipeline reports.	`db021_report_generator.py`
data_investigation/	Exploratory analysis and samples (not production pipeline).	`exampleProductRaw.json`, `exampleProductCleaned.json`, `data_investigation.py`
logging_system/	Shared logging for pipeline/scripts.	`logger.py`, `pipeline_logger_demo.py`
local_backend/	Local scan/persistence helpers (Node/JS).	`scanPipeline.js`, `persistenceLayer.js`, `testScan.js`, `testPersistence.js`
output/	Output chunks from pipeline runs.	`chunk_0_raw.json`, `chunk_0_clean.json`, `chunk_0_enriched.json`

🥄 Scraping

File: database/scraping/OpenFoodFacts-DataScrape.py

Streams .jsonl.gz from Open Food Facts (no full download).
Keeps only products where countries_tags includes australia.
Saves as openfoodfacts-australia.jsonl.

Do not commit a full jsonl to the repo. Use 10k-product chunks for Firestore (max 20k writes/day).

🧹 Cleaning

File: database/clean_data/cleanProductData.py

Prepares scraped data for ingestion: standardises, deduplicates, renames, and structures.

Load & deduplicate — Remove duplicate product entries by barcode.
Text & field normalisation — Clean names, brands, valid barcodes.
Numeric standardisation — Consistent units (e.g. grams).
Nutrient filtering — Keep energy, fats, carbs, protein, salt/sodium, etc.
Tag cleaning — Remove language prefixes (e.g. en:) from tags.
Image handling — Generate image URLs from barcodes.
Schema refinement — Drop unwanted columns, rename code → barcode, brands → brand, camelCase.
Save — Export cleaned JSON for Firestore/pipeline.

Note: clean_data/ is the only cleaning folder. All cleaning scripts and examples live there.

🔎 Data investigation

Folder: database/data_investigation/

Used for exploratory analysis and validation: test cleaning, compare raw vs cleaned, validate before seeding. For internal testing and reporting, not production pipeline scripts.

🌱 Seeding

File: database/seeding/seed_firestore.py (and seed_engine.py, seed_products.py)

Initialise Firebase — Use serviceAccountKey.json.
Load cleaned data — e.g. products_XXk_XXk.json (chunk range in filename).
Batch upload — Writes in chunks of 500, with retries and timestamps (dateAdded, lastUpdated).
Store — Products in Firestore products collection (default), keyed by barcode.

DB012 workflow:

Run validation via Validation/db012_validator.py.
Run Firestore integration checks via db012_integration_test.py.
Run local cart integration: npm run test:db012:cart (see Validation/DB012-Validation-Integration-Testing.md).
Or run seeding with validation gate: seed_firestore.py --validate (pipeline seed stage sets validate_before_seed).

⚙️ Pipeline summary

End-to-end flow:

Scraping → Cleaning → Enrichment → Seeding

Scrape — Collect Australian food product data.
Clean — Process and standardise (consistent schema).
Enrich — Add tags, scores, categories (pipeline modules).
Seed — Upload to Firestore.

Optional Investigation (e.g. data_investigation/) validates quality and accuracy outside the main pipeline. Run the full flow via pipeline/run_pipeline.py and pipeline/pipeline.config.json.

Root files in database/

File	Purpose
`DATABASE-README.md`	This file — structure, process, and quick reference.
`DB006_sample1.py`	Sample script for DB006 (QA).
`DB007-missing-values.md`	Notes on missing values (DB007).
`pipeline_checkpoints.json`, `pipeline_run_metadata.json`	Pipeline state and metadata (used by `run_pipeline.py`).
`__init__.py`	Makes `database` a Python package.

Quick reference: “Where do I…?”

I want to…	Go to…
Get raw Australian products	`scraping/OpenFoodFacts-DataScrape.py`
Clean raw data	`clean_data/cleanProductData.py` and `clean_data/normalization/`
Run full flow (clean → enrich → seed)	`pipeline/run_pipeline.py` and `pipeline/pipeline.config.json`
Upload products to Firestore	`seeding/` (e.g. `seed_firestore.py`, `seed_engine.py`)
Work on allergens	`Allergens/`
Run or improve cleaning QA	`QA/DB006_QA_cleaning.py`
Validate schema/product shape	`Validation/`, `seeding/schema_definition.json`, `Validation/DB012-Validation-Integration-Testing.md`
Explore data or examples	`data_investigation/`, `clean_data/IOExamples/`
Change pipeline logging	`logging_system/logger.py`

Summary: One cleaning folder (clean_data/). One doc (this file). Flow: Scraping → Clean → Enrich → Seed. New team members can use this README to find scraping scripts, cleaning scripts, enrichment (pipeline), seeding scripts, and QA/Reports.

Trimester 2026 T1 — full local workflow (mobile app, captcha, env vars): Documents/Guides/General/t1-2026-workflow-and-local-development.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧠 Food Remedy Database Documentation

📚 Table of Contents

What the database folder is for

How data flows

Folder-by-folder

🥄 Scraping

🧹 Cleaning

🔎 Data investigation

🌱 Seeding

⚙️ Pipeline summary

Root files in database/

Quick reference: “Where do I…?”

FilesExpand file tree

DATABASE-README.md

Latest commit

History

DATABASE-README.md

File metadata and controls

🧠 Food Remedy Database Documentation

📚 Table of Contents

What the database folder is for

How data flows

Folder-by-folder

🥄 Scraping

🧹 Cleaning

🔎 Data investigation

🌱 Seeding

⚙️ Pipeline summary

Root files in database/

Quick reference: “Where do I…?”