stacknil · stacknil · Jul 5, 2026 · Jul 5, 2026
diff --git a/tools/sbom-diff-and-risk/README.md b/tools/sbom-diff-and-risk/README.md
@@ -61,7 +61,7 @@ The v1.1 implementation sequence is fixed in
 - Produce machine-friendly JSON and reviewer-friendly Markdown reports.
 - Stay fully local-file based by default.
 
-## v0.1 Internal Component Model
+## Internal Component Model
 
 The normalized schema is the core design choice for the project:
 
@@ -84,6 +84,24 @@ Diff identity is intentionally conservative and uses this precedence:
 
 When a `purl` includes a version, the tool keeps the full value in `Component.purl` for auditability but uses the versionless package coordinate for identity so upgrades still diff as `changed`.
 
+Before indexing, each component is converted to an immutable
+`CanonicalComponentIdentity` containing normalized `ecosystem`,
+`package_name`, `version`, `purl`, and `component_key` fields. PURL syntax is
+parsed with the official `packageurl-python` implementation. PyPI package
+names use PEP 503 normalization; names for ecosystems without an explicit
+project rule preserve case.
+
+The index fails closed with stable diagnostics:
+
+- `duplicate_component` when one input repeats the same canonical identity and
+  normalized metadata;
+- `conflicting_metadata` when records share an identity but disagree on
+  metadata, or when explicit ecosystem/name/version fields disagree with a
+  purl.
+
+See [docs/v1.1-input-and-policy-semantics.md](docs/v1.1-input-and-policy-semantics.md)
+for the v1.1 identity contract and compatibility boundary.
+
 ## Non-goals
 
 - No vulnerability database integration in v0.1.

diff --git a/tools/sbom-diff-and-risk/docs/parser-boundaries.md b/tools/sbom-diff-and-risk/docs/parser-boundaries.md
@@ -39,6 +39,16 @@ The parser does not currently constrain the SPDX version or validate the
 document against an SPDX schema. Relationships and file-level data do not
 affect component identity or policy decisions.
 
+## Component identity validation
+
+After parsing, purl-bearing components are canonicalized before diff indexing.
+The purl type, name, and version must agree with the corresponding explicit
+component fields. Invalid or conflicting identity metadata fails closed as
+`conflicting_metadata`; repeated identical records fail as
+`duplicate_component`. See
+[v1.1-input-and-policy-semantics.md](v1.1-input-and-policy-semantics.md) for the
+typed identity contract.
+
 ## Requirements files
 
 `requirements.txt` is treated as a narrow manifest format, not as "everything pip can do in a file".

diff --git a/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md b/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md
@@ -13,19 +13,21 @@ of this monorepo.
 | Policy schema identifier | Implemented as `sbom-diff-risk.policy.v1`; legacy policy files remain readable |
 | Report schema identifier and compatibility tests | Implemented as `sbom-diff-risk.report.v1` across checked-in full-report fixtures |
 | Per-decision rule, evidence, reason, and confidence | Implemented additively in report v1 policy finding objects |
-| Component identity canonicalization | Next implementation slice; target semantics are fixed below |
+| Component identity canonicalization | Implemented as a typed value object with stable duplicate/conflict diagnostics |
 
-## Component identity target
+## Component identity contract
 
 The canonical identity record will expose these dimensions separately:
 
 - `ecosystem`: trimmed and normalized to a registered ecosystem identifier.
 - `package_name`: normalized with ecosystem-aware rules. PyPI names use PEP
-  503 normalization; other ecosystems require explicit test-backed rules.
+  503 normalization; ecosystems without an explicit project rule preserve
+  case rather than inheriting a universal lowercase rule.
 - `version`: trimmed but otherwise preserved as observed. The tool will not
   infer semantic equivalence between unrelated version schemes.
-- `purl`: parsed and normalized when present, while retaining the observed purl
-  in component evidence for auditability.
+- `purl`: parsed with `packageurl-python` and normalized when present, while
+  retaining the observed purl in component evidence for auditability. An
+  explicit component version does not get invented inside a versionless purl.
 - `component_key`: versionless package identity used to align before and after
   inputs. A version change remains a change, not an add plus remove.
 
@@ -34,6 +36,10 @@ Identity authority remains `purl`, then `bom_ref`, then the normalized
 its ecosystem and package coordinate. Explicit metadata that disagrees with
 that coordinate is a conflict, not an alternative identity.
 
+Canonical identity also drives change comparison: lexical PyPI variants that
+normalize to the same identity do not create a metadata change, and a version
+change carried only by the purl is still classified as `version_changed`.
+
 Within one input:
 
 - two records with the same key and identical normalized metadata fail closed
@@ -42,13 +48,15 @@ Within one input:
   as `conflicting_metadata`;
 - conflicting ecosystem, package name, or version information between a purl
   and explicit fields also fails closed as `conflicting_metadata`;
+- an invalid purl also fails closed as `conflicting_metadata` because it cannot
+  establish an unambiguous canonical identity;
 - metadata differences across the before and after inputs remain normal diff
   evidence and do not become same-input conflicts.
 
-The next code slice should introduce a typed canonical identity object and
-diagnostic error codes before changing report presentation. Cross-format tests
-must cover CycloneDX to SPDX alignment, PyPI name normalization, versioned
-purls, exact duplicates, and conflicting metadata.
+The implementation introduces a frozen `CanonicalComponentIdentity` object and
+keeps report presentation unchanged. Tests cover CycloneDX-to-SPDX alignment,
+PyPI name normalization, case preservation for ecosystems without a declared
+name rule, versioned purls, exact duplicates, and conflicting metadata.
 
 ## Policy and decision contract
 

diff --git a/tools/sbom-diff-and-risk/pyproject.toml b/tools/sbom-diff-and-risk/pyproject.toml
@@ -24,10 +24,11 @@ classifiers = [
   "Topic :: Security",
   "Topic :: Software Development :: Libraries :: Python Modules",
 ]
-dependencies = [
-  "packaging>=24.0",
-  "PyYAML>=6.0",
-]
+dependencies = [
+  "packaging>=24.0",
+  "packageurl-python>=0.17.6,<0.18",
+  "PyYAML>=6.0",
+]
 
 [project.urls]
 Homepage = "https://github.com/stacknil/scientific-computing-toolkit"

diff --git a/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py b/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py
@@ -0,0 +1,115 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+from packaging.utils import canonicalize_name
+from packageurl import PackageURL
+
+from .errors import ComponentIdentityDiagnosticCode, ComponentIdentityError
+from .models import Component
+
+
+@dataclass(slots=True, frozen=True)
+class CanonicalComponentIdentity:
+    ecosystem: str
+    package_name: str
+    version: str | None
+    purl: str | None
+    component_key: str
+
+
+def canonicalize_component_identity(component: Component) -> CanonicalComponentIdentity:
+    explicit_ecosystem = component.ecosystem.strip().lower()
+    explicit_name = _canonical_package_name(explicit_ecosystem, component.name)
+    explicit_version = _optional_str(component.version)
+
+    if component.purl is None:
+        if component.bom_ref:
+            component_key = f"bom-ref:{component.bom_ref.strip().lower()}"
+        else:
+            component_key = f"coord:{explicit_ecosystem}:{explicit_name}"
+        return CanonicalComponentIdentity(
+            ecosystem=explicit_ecosystem,
+            package_name=explicit_name,
+            version=explicit_version,
+            purl=None,
+            component_key=component_key,
+        )
+
+    parsed = _parse_purl(component.purl)
+    purl_ecosystem = parsed.type.strip().lower()
+    purl_name = _canonical_package_name(purl_ecosystem, parsed.name)
+    purl_version = _optional_str(parsed.version)
+
+    conflicts: list[str] = []
+    if explicit_ecosystem != purl_ecosystem:
+        conflicts.append(f"ecosystem={explicit_ecosystem!r} disagrees with purl type={purl_ecosystem!r}")
+    if explicit_name != purl_name:
+        conflicts.append(f"package name={explicit_name!r} disagrees with purl name={purl_name!r}")
+    if explicit_version is not None and purl_version is not None and explicit_version != purl_version:
+        conflicts.append(f"version={explicit_version!r} disagrees with purl version={purl_version!r}")
+    if conflicts:
+        raise ComponentIdentityError(
+            ComponentIdentityDiagnosticCode.CONFLICTING_METADATA,
+            "; ".join(conflicts),
+            component_key=_purl_component_key(parsed, purl_ecosystem, purl_name),
+        )
+
+    canonical_version = purl_version or explicit_version
+    canonical_purl = _canonical_purl(parsed, purl_ecosystem, purl_name, purl_version)
+    return CanonicalComponentIdentity(
+        ecosystem=purl_ecosystem,
+        package_name=purl_name,
+        version=canonical_version,
+        purl=canonical_purl,
+        component_key=_purl_component_key(parsed, purl_ecosystem, purl_name),
+    )
+
+
+def _parse_purl(raw_purl: str) -> PackageURL:
+    try:
+        return PackageURL.from_string(raw_purl.strip())
+    except ValueError as exc:
+        raise ComponentIdentityError(
+            ComponentIdentityDiagnosticCode.CONFLICTING_METADATA,
+            f"purl is not valid: {exc}",
+        ) from exc
+
+
+def _canonical_purl(
+    parsed: PackageURL,
+    ecosystem: str,
+    package_name: str,
+    version: str | None,
+) -> str:
+    return PackageURL(
+        type=ecosystem,
+        namespace=parsed.namespace,
+        name=package_name,
+        version=version,
+        qualifiers=parsed.qualifiers,
+        subpath=parsed.subpath,
+    ).to_string()
+
+
+def _purl_component_key(parsed: PackageURL, ecosystem: str, package_name: str) -> str:
+    identity_purl = PackageURL(
+        type=ecosystem,
+        namespace=parsed.namespace,
+        name=package_name,
+    ).to_string()
+    return f"purl:{identity_purl}"
+
+
+def _canonical_package_name(ecosystem: str, name: str) -> str:
+    stripped = name.strip()
+    if ecosystem == "pypi":
+        return canonicalize_name(stripped)
+    return stripped
+
+
+def _optional_str(value: str | None) -> str | None:
+    if value is None:
+        return None
+    stripped = value.strip()
+    return stripped or None
diff --git a/tools/sbom-diff-and-risk/src/sbom_diff_risk/diffing.py b/tools/sbom-diff-and-risk/src/sbom_diff_risk/diffing.py
@@ -2,51 +2,25 @@
 
 from typing import Iterable
 
+from .component_identity import canonicalize_component_identity
+from .errors import ComponentIdentityDiagnosticCode, ComponentIdentityError
 from .models import Component, ComponentChange
 
 
 def component_key(component: Component) -> str:
     """Return a stable identity with purl -> bom_ref -> (ecosystem, name)."""
-    if component.purl:
-        return f"purl:{_purl_identity(component.purl)}"
-    if component.bom_ref:
-        return f"bom-ref:{component.bom_ref.strip().lower()}"
-    ecosystem = component.ecosystem.strip().lower()
-    name = component.name.strip().lower()
-    return f"coord:{ecosystem}:{name}"
-
-
-def _purl_identity(purl: str) -> str:
-    candidate = purl.strip().lower()
-    if not candidate.startswith("pkg:"):
-        return candidate
-
-    end = len(candidate)
-    for separator in ("?", "#"):
-        position = candidate.find(separator)
-        if position != -1:
-            end = min(end, position)
-
-    base = candidate[:end]
-    version_separator = base.rfind("@")
-    name_separator = base.rfind("/")
-    if version_separator != -1 and version_separator > name_separator:
-        return base[:version_separator]
-
-    return base
+    return canonicalize_component_identity(component).component_key
 
 
 def _component_signature(component: Component) -> tuple[object, ...]:
+    identity = canonicalize_component_identity(component)
     return (
-        component.name,
-        component.version,
-        component.ecosystem,
-        component.purl,
-        component.license_id,
-        component.supplier,
-        component.source_url,
-        component.bom_ref,
-        component.raw_type,
+        identity,
+        _normalized_metadata(component.license_id),
+        _normalized_metadata(component.supplier),
+        _normalized_metadata(component.source_url),
+        _normalized_metadata(component.bom_ref, lower=True),
+        _normalized_metadata(component.raw_type, lower=True),
     )
 
 
@@ -71,9 +45,11 @@ def diff_components(
         if _component_signature(before_component) == _component_signature(after_component):
             continue
 
-        classification = "version_changed"
-        if before_component.version == after_component.version:
-            classification = "metadata_changed"
+        before_identity = canonicalize_component_identity(before_component)
+        after_identity = canonicalize_component_identity(after_component)
+        classification = (
+            "version_changed" if before_identity.version != after_identity.version else "metadata_changed"
+        )
 
         changed.append(
             ComponentChange(
@@ -90,8 +66,37 @@ def diff_components(
 def _index_components(components: Iterable[Component], side: str) -> dict[str, Component]:
     indexed: dict[str, Component] = {}
     for component in components:
-        key = component_key(component)
+        try:
+            key = component_key(component)
+        except ComponentIdentityError as exc:
+            raise ComponentIdentityError(
+                exc.code,
+                f"{exc.detail} in {side} input",
+                side=side,
+                component_key=exc.component_key,
+            ) from exc
         if key in indexed:
-            raise ValueError(f"Duplicate component identity in {side} input: {key}")
+            existing = indexed[key]
+            if _component_signature(existing) == _component_signature(component):
+                code = ComponentIdentityDiagnosticCode.DUPLICATE_COMPONENT
+                label = "duplicate component"
+            else:
+                code = ComponentIdentityDiagnosticCode.CONFLICTING_METADATA
+                label = "conflicting metadata"
+            raise ComponentIdentityError(
+                code,
+                f"{label} in {side} input for {key}",
+                side=side,
+                component_key=key,
+            )
         indexed[key] = component
     return indexed
+
+
+def _normalized_metadata(value: str | None, *, lower: bool = False) -> str | None:
+    if value is None:
+        return None
+    normalized = value.strip()
+    if lower:
+        normalized = normalized.lower()
+    return normalized or None
diff --git a/tools/sbom-diff-and-risk/src/sbom_diff_risk/errors.py b/tools/sbom-diff-and-risk/src/sbom_diff_risk/errors.py
@@ -1,5 +1,7 @@
 from __future__ import annotations
 
+from enum import StrEnum
+
 
 class ParseError(ValueError):
     """Raised when an input file cannot be parsed into normalized components."""
@@ -19,3 +21,26 @@ class InputSelectionError(ParseError):
 
 class PolicyError(ValueError):
     """Raised when policy parsing or evaluation inputs are invalid."""
+
+
+class ComponentIdentityDiagnosticCode(StrEnum):
+    DUPLICATE_COMPONENT = "duplicate_component"
+    CONFLICTING_METADATA = "conflicting_metadata"
+
+
+class ComponentIdentityError(ValueError):
+    """Raised when one input cannot produce an unambiguous component index."""
+
+    def __init__(
+        self,
+        code: ComponentIdentityDiagnosticCode,
+        message: str,
+        *,
+        side: str | None = None,
+        component_key: str | None = None,
+    ) -> None:
+        self.code = code
+        self.detail = message
+        self.side = side
+        self.component_key = component_key
+        super().__init__(f"{code.value}: {message}")