AURUM

Raw data in. Hallmarked golden records out.


📊 Diagrams	Pipeline · Decision model · gbrain memory · Dream cycle
❓ FAQ	Storage · Enterprise deployment · Air-gapped · Compliance
🧠 gbrain Integration	Knowledge graph · Stewardship memory · Nightly dream cycle

AURUM is a vendor-agnostic Master Data Management reference implementation. Five stages — named for the journey from raw ore to hallmarked gold — cover the full MDM lifecycle across 7 domains.

Status: v0.2.1. This is an active reference build. Some components are working code, some are scaffolded interfaces, some are planned for v0.2.0. The Component Status table tells you exactly what's what — no overpromises. See ROADMAP.md for v0.2.0 commitments.

Why AURUM?

Traditional MDM programmes share a common shape: a vendor, an SI, a long delivery timeline, and knowledge that walks out the door when the consultants do. AURUM was built because that pattern keeps repeating.

	Traditional MDM	AURUM
Licence	Annual vendor fee	CC BY-NC-SA 4.0
Time to value	12–24 months	Days
DQ rule delivery	Weeks of dev cycles	Seconds (LLM-generated)
Institutional knowledge	Lives in people	Compounds in the graph
Cross-domain visibility	Siloed by domain	7 domains, unified mastering

The Knowledge Problem

Time to Ship a DQ Rule

The Invisible Entity Problem

What You're Actually Buying

Diagrams

Visual walkthroughs of the pipeline, decision model, and knowledge graph layer — all render natively on GitHub:

📊 docs/diagrams/AURUM_DIAGRAMS.md

Includes: AURUM 5-stage pipeline · Three-tier stewardship model · gbrain memory layer · Nightly dream cycle · Knowledge compounding curve

❓ docs/AURUM_FAQ.md

Common questions answered in plain English: Where is data stored? · Enterprise deployment options · Air-gapped setup · Does data go to OpenAI? · Is it production-ready? · What does gbrain add?

The Five Stages

In metallurgy, gold starts as raw ore. It is assayed, unearthed, refined, unfurled into the world, and hallmarked with proof of provenance. MDM follows the same arc — and every stage name tells you what happens there.

Source Systems
      │
      ▼
┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  ASSAY   │──▶│ UNEARTH  │──▶│  REFINE  │──▶│  UNFURL  │──▶│   MARK   │
│          │   │          │   │          │   │          │   │          │
│ Test the │   │ Surface  │   │ Fuse many│   │ Issue to │   │ Stamp    │
│ raw ore  │   │ what's   │   │ into one │   │ the world│   │ lineage  │
│          │   │ buried   │   │ golden   │   │          │   │ &        │
│          │   │          │   │ record   │   │          │   │ provenance│
└──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘
                                    │
                            Golden Record Store

#	Stage	Responsibility
01	ASSAY	Ingestion · Schema Mapping · Migration
02	UNEARTH	Profiling · DQ Rules · Anomaly Detection
03	REFINE	Blocking · Matching · Survivorship · Golden Record
04	UNFURL	Publication · APIs · Subscriptions
05	MARK	Reverse Integration · Lineage · Reconciliation

Domains

Domain	Core Mastering Challenge
Customer	Identity resolution across channels
Product	Variant explosion, UOM conflicts, hierarchy
Vendor	Legal entity vs trading entity, group vs subsidiary
Asset	Lifecycle state, location drift, maintenance lineage
Location	Hierarchy conflicts, geocoding drift
Employee	Org hierarchy changes, multi-role assignments
Counterparty	Dual role (vendor + customer), legal entity identifiers

Note: All 7 domain profilers ship working in v0.2.0. Sample data covers all 7 domains end-to-end (600–300 rows each, deliberately dirty). Domain-specific matchers (SKU normalization, legal-entity disambiguation, lifecycle-aware matching) remain on the v0.3.0 roadmap.

Component Status

Three statuses, no fudge:

✅ Working — runs end-to-end, tested, demoable
🔧 Stub — interface exists, full logic deferred
📋 Planned — directory exists, work scheduled (see ROADMAP.md)

ASSAY — Stage 01

Component	Status	Notes
Schema inspector	✅ Working	Field profiling, type inference, null/cardinality stats
CSV connector	✅ Working	Via pandas, used by demo
Other source connectors (REST, DB, flat file)	📋 Planned	Directory scaffolded
Migration cookbook (CHARON pattern)	📋 Planned	Documented in stage-overview, code pending

UNEARTH — Stage 02

Component	Status	Notes
Customer profiler	✅ Working	Completeness, format, consistency rules
Product profiler	✅ Working	SKU, barcode (EAN/UPC/GTIN), UOM whitelist, brand/name casing
Vendor profiler	✅ Working	Tax ID, country, legal-vs-trading name, self-parent
Asset profiler	✅ Working	Tag format, lifecycle whitelist, orphan detection
Location profiler	✅ Working	Lat/lon range, Null Island detector, self-parent
Employee profiler	✅ Working	Email, hire-date ISO format, self-manager, status whitelist
Counterparty profiler	✅ Working	LEI (ISO 17442), role flagging, jurisdiction
DQ rule engine (standalone)	🔧 Stub	Rules currently embedded in profilers
ML anomaly detector	✅ Working	Isolation Forest with generic per-column feature engineering
LLM rule generator	✅ Working	Anthropic SDK + tool-use schema + AST safety guards; rules saved for steward review, never auto-promoted

REFINE — Stage 03

Component	Status	Notes
Matcher (RapidFuzz + Jaro-Winkler + token + name-boost)	✅ Working	Composite scoring with name-boost floor
Transitive cluster builder	✅ Working	Connected components on the match graph
Survivorship engine	✅ Working	Standardize → validate → survive pipeline
Linked-tuple geography survivorship	✅ Working	Prevents (Dubai, UK) frankenrecords
Golden record assembly + trust scoring	✅ Working	0.6 × completeness + 0.4 × source diversity
Blocking engine (LSH / sorted-neighbourhood)	🔧 Stub	Naive O(n²) used inline for the demo
LLM tiebreaker for borderline matches	📋 Planned	v0.2.0

UNFURL — Stage 04

Component	Status	Notes
Publisher interface	🔧 Stub	Defines registry pattern; HTTP push not yet wired
FastAPI publication layer	📋 Planned	v0.2.0
Subscription / consumer routing	📋 Planned	v0.2.0

MARK — Stage 05

Component	Status	Notes
Lineage event tracker	✅ Working	In-memory log per record (production: persistent store)
Reverse sync engine	📋 Planned	v0.2.0 — golden-record-change → downstream sync plan
Reconciliation	📋 Planned	v0.2.0 — detect downstream drift from master

Runtimes

Component	Status	Notes
MCP server	✅ Working	3 tools: `assay_schema`, `unearth_profile`, `refine_match`
CLI (click + rich)	✅ Working	5 commands: `assay`, `unearth`, `anomaly`, `demo`, `domains`
FastAPI HTTP runtime	✅ Working	6 endpoints, multipart CSV upload, OpenAPI/ReDoc docs
Streamlit UI	📋 Planned	v0.2.0
Airflow / Prefect orchestration	📋 Planned	Stubs only

Power Platform Companion Track

Component	Status	Notes
Customer Dataverse schema (YAML)	✅ Working	Deployable via PAC CLI
Other 6 domain schemas	📋 Planned	v0.2.0
Power Automate flow JSON exports	📋 Planned	v0.2.0 — Steward Approval, Exception Routing, Daily Reconciliation
AI Builder model definitions	📋 Planned	v0.2.0 — DQ Score Predictor
Copilot Studio bot config	📋 Planned	v0.2.0 — MDM Steward Assistant

Certification Program

Component	Status	Notes
Curriculum overview (Foundation → Practitioner → Architect)	✅ Written	3-tier outline in `certification/README.md`
Module content + labs	📋 Planned	Lab directory exists, content pending

Demo

Component	Status	Notes
End-to-end pipeline demo	✅ Working	All 5 stages run with assertions; CI-verified

AI / ML in v0.1.3 — what's actually there

The repo claims AI augmentation across stages. In v0.1.3 the working AI/ML components are:

REFINE classical ML matching — RapidFuzz token scoring + Jellyfish Jaro-Winkler + composite weighting. Deterministic, fast, well-understood. Not LLM-based — and that's the right choice for matching at scale.
UNEARTH Isolation Forest anomaly detector — flags rows whose feature combinations are unlike the rest of the dataset. Generic per-column feature engineering (length, character composition, null state) so it works across all 7 domains without per-domain tuning. Deterministic.
UNEARTH LLM rule generator — translates steward prose ("phone numbers must be UAE-format if country is UAE") into deterministic Python rules via the Anthropic SDK with tool-use structured output. Saves to a generated/ directory for human review; never auto-promotes. AST safety guards reject imports, eval/exec, and missing check functions.
MCP server — exposes the AURUM pipeline as an MCP-compatible server invokable from Claude Code, Cursor, and any Hermes/Nous-based agentic runtime.

LLM-based components (rule generation, anomaly explanation, match tiebreakers) are designed and scaffolded but not yet implemented. They land in v0.2.0. The architectural choice — where LLMs add value vs. where they add risk in MDM — is documented in docs/architecture/ai-strategy.md.

Quickstart

git clone https://github.com/RajaMDM/AURUM.git
cd AURUM
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python shared/sample_data/generate_all.py
python demo/end_to_end_demo.py

The demo runs ASSAY → UNEARTH → REFINE → UNFURL → MARK end-to-end with assertions for cluster integrity, frankenrecord detection, and trust scoring.

CLI usage

python -m runtimes.cli --help
python -m runtimes.cli assay shared/sample_data/output/customers_dirty.csv
python -m runtimes.cli unearth customer shared/sample_data/output/customers_dirty.csv
python -m runtimes.cli anomaly shared/sample_data/output/customers_dirty.csv --domain customer
python -m runtimes.cli demo

Every CLI command supports --json-out for machine-readable output.

HTTP API

uvicorn runtimes.api.main:app --reload --port 8000
# OpenAPI docs:  http://localhost:8000/docs
# ReDoc:         http://localhost:8000/redoc

# Examples
curl -F file=@shared/sample_data/output/customers_dirty.csv http://localhost:8000/assay
curl -F file=@shared/sample_data/output/customers_dirty.csv http://localhost:8000/unearth/customer
curl -F file=@shared/sample_data/output/customers_dirty.csv "http://localhost:8000/anomaly?domain=customer"

Repository Structure

AURUM/
├── shared/              # Domain models · Sample data · Utilities
├── assay/               # Stage 01: Ingestion, schema mapping, migration
├── unearth/             # Stage 02: Profiling, DQ, anomaly detection
├── refine/              # Stage 03: Matching, mastering, golden record
├── unfurl/              # Stage 04: Publication, APIs, subscriptions
├── mark/                # Stage 05: Reverse integration, lineage
├── runtimes/            # CLI · FastAPI · MCP · Streamlit · Orchestration
├── power-platform/      # Dataverse schemas · Power Automate · AI Builder
├── certification/       # Foundation → Practitioner → Architect
├── docs/                # Architecture · Methodology · Roles & Agents
└── demo/                # End-to-end pipeline walkthrough

MCP Compatibility

The AURUM pipeline is invocable as an MCP server with three working tools:

python runtimes/mcp/server.py

Tool	Purpose
`assay_schema`	Inspect a CSV source file — field types, nulls, cardinality
`unearth_profile`	Profile a customer CSV for DQ issues
`refine_match`	Find duplicate candidates in a customer dataset

Compatible with Claude Code, Cursor, Hermes/Nous, and any MCP-compliant runtime.

Use-Case Library

The use_cases/ directory contains 41 real-world MDM scenario playbooks across all 7 domains — organised in three tiers: single-domain, cross-domain pairs, and a grand scenario where all 7 domains interact.

Tier	Use Cases	Description
Tier 1 — Single Domain	35	5 scenarios per domain
Tier 2 — Cross-Domain Pairs	5	2–3 domains talking to each other
Tier 3 — Grand Scenario	1	All 7 domains in one real-world event

Domain	Use Cases
Customer	Identity resolution, cross-channel merge, DQ failures, golden record conflicts, lineage audit
Product	SKU deduplication, UOM conflicts, variant explosion, barcode DQ, price conflicts
Vendor	Legal vs trading entity, group/subsidiary hierarchy, tax ID DQ, duplicate detection, vendor-customer crossover
Asset	Lifecycle state conflicts, orphaned assets, location drift, serial number dedup, maintenance lineage
Location	Hierarchy conflicts, geocoding drift, store/warehouse duplicates, address standardisation, parent-child resolution
Employee	Multi-system identity merge, org hierarchy changes, multi-role assignments, leaver/rehire detection, cost centre realignment
Counterparty	Dual-role detection, LEI validation, legal entity dedup, jurisdiction mismatch, relationship lineage

→ Browse all 41 use cases

Narratives & Deep Dives

Document	What It Is
The Intelligent Refinery	Start here. MDM in the AI era — LLM rule generation, agentic stewardship, MCP-native pipeline, the 5 questions every CTO should ask
Imagine a World	The 4-act MDM journey — single domain → all domains → cross-domain → full golden web, with before/after stories for every scenario
How We Build AURUM	The AI stack behind this project — Hermes, BMAD, gstack, Claude Code, Telegram. Running an enterprise MDM project from a phone.
Data Sovereignty & Compliance	How AURUM aligns with data laws across the GCC (UAE PDPL, ADDA, DDA, KSA PDPL, Qatar PDPL) and globally (GDPR, CCPA, DPDP, PIPL, POPIA and more)
Disclaimer	All company names, entities and scenarios in this repo are entirely fictional. Any resemblance to real organisations is purely coincidental.

Contributing

See CONTRIBUTING.md. High-priority contributions for v0.2.0:

Domain-specific matchers (Product SKU normalization, Vendor legal-entity disambiguation, Asset lifecycle-aware matching)
LLM rule generator for UNEARTH
ML anomaly detector for UNEARTH
Reverse sync engine for MARK
Power Automate flow exports
Real Power Apps (model-driven) screenshots

Author

Built and maintained by Raja Shahnawaz Soni — Enterprise Data Management Practitioner, speaker at Informatica World 2023, and builder of The MDM Lab, BrainDrop, SYNAPTIQ, QualIQ, and Agents for Good.

"Anyone can describe a system. I'd rather hand you a working one and say — here, try it."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AURUM

Why AURUM?

Diagrams

The Five Stages

Domains

Component Status

ASSAY — Stage 01

UNEARTH — Stage 02

REFINE — Stage 03

UNFURL — Stage 04

MARK — Stage 05

Runtimes

Power Platform Companion Track

Certification Program

Demo

AI / ML in v0.1.3 — what's actually there

Quickstart

CLI usage

HTTP API

Repository Structure

MCP Compatibility

Use-Case Library

Narratives & Deep Dives

Contributing

Author

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
assay		assay
certification		certification
demo		demo
docs		docs
mark		mark
power-platform		power-platform
refine		refine
runtimes		runtimes
shared		shared
tests		tests
unearth		unearth
unfurl		unfurl
use_cases		use_cases
.gitignore		.gitignore
AUTHOR.md		AUTHOR.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AURUM

Why AURUM?

Diagrams

The Five Stages

Domains

Component Status

ASSAY — Stage 01

UNEARTH — Stage 02

REFINE — Stage 03

UNFURL — Stage 04

MARK — Stage 05

Runtimes

Power Platform Companion Track

Certification Program

Demo

AI / ML in v0.1.3 — what's actually there

Quickstart

CLI usage

HTTP API

Repository Structure

MCP Compatibility

Use-Case Library

Narratives & Deep Dives

Contributing

Author

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages