diff --git a/tools/sbom-diff-and-risk/README.md b/tools/sbom-diff-and-risk/README.md index 92d2936..8f081f3 100644 --- a/tools/sbom-diff-and-risk/README.md +++ b/tools/sbom-diff-and-risk/README.md @@ -88,8 +88,11 @@ Before indexing, each component is converted to an immutable `CanonicalComponentIdentity` containing normalized `ecosystem`, `package_name`, `version`, `purl`, and `component_key` fields. PURL syntax is parsed with the official `packageurl-python` implementation. PyPI package -names use PEP 503 normalization; names for ecosystems without an explicit -project rule preserve case. +names use PEP 503 normalization. npm package names follow the +`packageurl-python` npm purl name form. Names for ecosystems without an +explicit project rule preserve case. +The test-backed ecosystem matrix is documented in +[docs/component-identity-canonicalization.md](docs/component-identity-canonicalization.md). The index fails closed with stable diagnostics: diff --git a/tools/sbom-diff-and-risk/docs/component-identity-canonicalization.md b/tools/sbom-diff-and-risk/docs/component-identity-canonicalization.md new file mode 100644 index 0000000..69933de --- /dev/null +++ b/tools/sbom-diff-and-risk/docs/component-identity-canonicalization.md @@ -0,0 +1,47 @@ +# Component identity canonicalization matrix + +`sbom-diff-and-risk` uses an explicit ecosystem matrix before it aligns +components across inputs. The matrix is intentionally narrow: the tool does +not apply a universal lowercase rule and does not infer package-manager +equivalence beyond the rows below. + +The executable rule set lives in `sbom_diff_risk.component_identity` as +`canonicalization_rules()`. + +## Matrix + +| Ecosystem | Package name rule | Namespace rule | Version rule | Boundary | +| --- | --- | --- | --- | --- | +| `pypi` | `pep503`: trim, lowercase, and collapse `-`, `_`, and `.` name runs through `packaging.utils.canonicalize_name` | Preserve any parsed purl namespace; do not invent one | Preserve the observed version string after trimming | Does not resolve Python extras, environment markers, indexes, or package availability | +| `maven` | `preserve-observed`: trim only | Preserve the parsed purl namespace in `component_key` | Preserve the observed version string after trimming | Does not validate Maven group/artifact naming rules or registry identity | +| `npm` | `packageurl-npm-name`: trim and lowercase the package name to match the `packageurl-python` npm purl name form | Preserve the parsed purl namespace, including scope-like namespaces when present | Preserve the observed version string after trimming | Does not query the npm registry or infer scoped package aliases | +| `nuget` | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | Does not apply NuGet registry lookup or package ID equivalence | +| `generic` | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | No package-manager semantics are inferred | +| unknown ecosystem | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | Treated as a local coordinate, not as a supported package-manager model | + +The ecosystem identifier itself is always trimmed and lowercased so purl type +comparison stays deterministic. That rule does not imply package-name +lowercasing. + +## Identity precedence + +Identity authority remains: + +1. purl package coordinate, without version; +2. `bom_ref`, when no purl is present; +3. normalized `(ecosystem, package_name)` coordinate. + +When a purl is present, its type, package name, and version must agree with +explicit component fields. Disagreement fails closed as `conflicting_metadata`. +Within a single input, repeated identical canonical identities fail as +`duplicate_component`; repeated identities with different normalized metadata +fail as `conflicting_metadata`. + +## Non-claims + +This matrix is a deterministic comparison contract, not a resolver: + +- it does not query package registries; +- it does not decide whether two ecosystem-specific coordinates are aliases; +- it does not rewrite versions into semantic-version equivalents; +- it does not make safety, malware, or CVE claims. diff --git a/tools/sbom-diff-and-risk/docs/parser-boundaries.md b/tools/sbom-diff-and-risk/docs/parser-boundaries.md index 607658b..8ce0c01 100644 --- a/tools/sbom-diff-and-risk/docs/parser-boundaries.md +++ b/tools/sbom-diff-and-risk/docs/parser-boundaries.md @@ -46,6 +46,8 @@ The purl type, name, and version must agree with the corresponding explicit component fields. Invalid or conflicting identity metadata fails closed as `conflicting_metadata`; repeated identical records fail as `duplicate_component`. See +[component-identity-canonicalization.md](component-identity-canonicalization.md) +for the ecosystem-specific canonicalization matrix and [v1.1-input-and-policy-semantics.md](v1.1-input-and-policy-semantics.md) for the typed identity contract. diff --git a/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md b/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md index 108f660..3935f36 100644 --- a/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md +++ b/tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md @@ -14,6 +14,7 @@ of this monorepo. | Report schema identifier and compatibility tests | Implemented as `sbom-diff-risk.report.v1` across checked-in full-report fixtures | | Per-decision rule, evidence, reason, and confidence | Implemented additively in report v1 policy finding objects | | Component identity canonicalization | Implemented as a typed value object with stable duplicate/conflict diagnostics | +| Ecosystem-specific canonicalization matrix | Implemented in [component-identity-canonicalization.md](component-identity-canonicalization.md); package-name normalization is explicit per ecosystem rather than universal | ## Component identity contract @@ -21,8 +22,9 @@ The canonical identity record will expose these dimensions separately: - `ecosystem`: trimmed and normalized to a registered ecosystem identifier. - `package_name`: normalized with ecosystem-aware rules. PyPI names use PEP - 503 normalization; ecosystems without an explicit project rule preserve - case rather than inheriting a universal lowercase rule. + 503 normalization; npm names use the `packageurl-python` npm purl name + form; ecosystems without an explicit project rule preserve case rather than + inheriting a universal lowercase rule. - `version`: trimmed but otherwise preserved as observed. The tool will not infer semantic equivalence between unrelated version schemes. - `purl`: parsed with `packageurl-python` and normalized when present, while @@ -54,9 +56,11 @@ Within one input: evidence and do not become same-input conflicts. The implementation introduces a frozen `CanonicalComponentIdentity` object and -keeps report presentation unchanged. Tests cover CycloneDX-to-SPDX alignment, -PyPI name normalization, case preservation for ecosystems without a declared -name rule, versioned purls, exact duplicates, and conflicting metadata. +keeps report presentation unchanged. The executable matrix is exposed through +`canonicalization_rules()` so tests can assert the supported ecosystem rules. +Tests cover CycloneDX-to-SPDX alignment, PyPI name normalization, case +preservation for ecosystems without a declared name rule, namespace retention, +versioned purls, exact duplicates, and conflicting metadata. ## Policy and decision contract diff --git a/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py b/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py index 8b25ab9..371690a 100644 --- a/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py +++ b/tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py @@ -9,6 +9,14 @@ from .models import Component +@dataclass(slots=True, frozen=True) +class EcosystemCanonicalizationRule: + ecosystem: str + package_name_rule: str + namespace_rule: str + version_rule: str + + @dataclass(slots=True, frozen=True) class CanonicalComponentIdentity: ecosystem: str @@ -18,6 +26,56 @@ class CanonicalComponentIdentity: component_key: str +_REGISTERED_CANONICALIZATION_RULES: dict[str, EcosystemCanonicalizationRule] = { + "generic": EcosystemCanonicalizationRule( + ecosystem="generic", + package_name_rule="preserve-observed", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ), + "maven": EcosystemCanonicalizationRule( + ecosystem="maven", + package_name_rule="preserve-observed", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ), + "npm": EcosystemCanonicalizationRule( + ecosystem="npm", + package_name_rule="packageurl-npm-name", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ), + "nuget": EcosystemCanonicalizationRule( + ecosystem="nuget", + package_name_rule="preserve-observed", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ), + "pypi": EcosystemCanonicalizationRule( + ecosystem="pypi", + package_name_rule="pep503", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ), +} + + +def canonicalization_rules() -> tuple[EcosystemCanonicalizationRule, ...]: + return tuple(_REGISTERED_CANONICALIZATION_RULES[name] for name in sorted(_REGISTERED_CANONICALIZATION_RULES)) + + +def canonicalization_rule_for_ecosystem(ecosystem: str) -> EcosystemCanonicalizationRule: + normalized_ecosystem = ecosystem.strip().lower() + if normalized_ecosystem in _REGISTERED_CANONICALIZATION_RULES: + return _REGISTERED_CANONICALIZATION_RULES[normalized_ecosystem] + return EcosystemCanonicalizationRule( + ecosystem=normalized_ecosystem, + package_name_rule="preserve-observed", + namespace_rule="preserve-purl-namespace", + version_rule="preserve-observed", + ) + + def canonicalize_component_identity(component: Component) -> CanonicalComponentIdentity: explicit_ecosystem = component.ecosystem.strip().lower() explicit_name = _canonical_package_name(explicit_ecosystem, component.name) @@ -103,9 +161,14 @@ def _purl_component_key(parsed: PackageURL, ecosystem: str, package_name: str) - def _canonical_package_name(ecosystem: str, name: str) -> str: stripped = name.strip() - if ecosystem == "pypi": + rule = canonicalization_rule_for_ecosystem(ecosystem) + if rule.package_name_rule == "pep503": return canonicalize_name(stripped) - return stripped + if rule.package_name_rule == "packageurl-npm-name": + return stripped.lower() + if rule.package_name_rule == "preserve-observed": + return stripped + raise AssertionError(f"unknown package name canonicalization rule: {rule.package_name_rule}") def _optional_str(value: str | None) -> str | None: diff --git a/tools/sbom-diff-and-risk/tests/test_component_identity.py b/tools/sbom-diff-and-risk/tests/test_component_identity.py index 9fd94d6..7f051ec 100644 --- a/tools/sbom-diff-and-risk/tests/test_component_identity.py +++ b/tools/sbom-diff-and-risk/tests/test_component_identity.py @@ -1,12 +1,47 @@ from __future__ import annotations +from pathlib import Path + import pytest -from sbom_diff_risk.component_identity import CanonicalComponentIdentity, canonicalize_component_identity +from sbom_diff_risk.component_identity import ( + CanonicalComponentIdentity, + canonicalization_rule_for_ecosystem, + canonicalization_rules, + canonicalize_component_identity, +) from sbom_diff_risk.errors import ComponentIdentityDiagnosticCode, ComponentIdentityError from sbom_diff_risk.models import Component +def test_canonicalization_rules_expose_ecosystem_specific_matrix() -> None: + rules = {rule.ecosystem: rule for rule in canonicalization_rules()} + + assert set(rules) == {"generic", "maven", "npm", "nuget", "pypi"} + assert rules["pypi"].package_name_rule == "pep503" + assert rules["maven"].package_name_rule == "preserve-observed" + assert rules["npm"].package_name_rule == "packageurl-npm-name" + assert rules["npm"].namespace_rule == "preserve-purl-namespace" + assert rules["nuget"].version_rule == "preserve-observed" + + +def test_canonicalization_rules_are_documented() -> None: + docs_path = Path(__file__).resolve().parents[1] / "docs" / "component-identity-canonicalization.md" + docs_text = docs_path.read_text(encoding="utf-8") + + for rule in canonicalization_rules(): + assert f"`{rule.ecosystem}`" in docs_text + assert f"`{rule.package_name_rule}`" in docs_text + + +def test_canonicalization_rule_for_unknown_ecosystem_preserves_observed_name() -> None: + rule = canonicalization_rule_for_ecosystem("CustomEcosystem") + + assert rule.ecosystem == "customecosystem" + assert rule.package_name_rule == "preserve-observed" + assert rule.namespace_rule == "preserve-purl-namespace" + + def test_canonicalize_component_identity_normalizes_pypi_coordinate() -> None: component = Component( name="Requests_Test", @@ -40,19 +75,81 @@ def test_canonicalize_component_identity_uses_coordinate_without_purl() -> None: assert identity.purl is None -def test_canonicalize_component_identity_preserves_unregistered_name_case() -> None: +@pytest.mark.parametrize( + ("component", "expected_package_name", "expected_purl", "expected_key"), + [ + ( + Component( + name="EnterpriseLibrary.Common", + version="6.0.1304", + ecosystem="nuget", + purl="pkg:nuget/EnterpriseLibrary.Common@6.0.1304", + ), + "EnterpriseLibrary.Common", + "pkg:nuget/EnterpriseLibrary.Common@6.0.1304", + "purl:pkg:nuget/EnterpriseLibrary.Common", + ), + ( + Component( + name="CaseSensitiveArtifact", + version="1.2.3", + ecosystem="maven", + purl="pkg:maven/Com.Example/CaseSensitiveArtifact@1.2.3", + ), + "CaseSensitiveArtifact", + "pkg:maven/Com.Example/CaseSensitiveArtifact@1.2.3", + "purl:pkg:maven/Com.Example/CaseSensitiveArtifact", + ), + ( + Component( + name="LeftPad", + version="1.3.0", + ecosystem="npm", + purl="pkg:npm/%40ExampleScope/LeftPad@1.3.0", + ), + "leftpad", + "pkg:npm/%40ExampleScope/leftpad@1.3.0", + "purl:pkg:npm/%40ExampleScope/leftpad", + ), + ( + Component( + name="CaseSensitiveLib", + version="2026.7", + ecosystem="generic", + purl="pkg:generic/Vendor/CaseSensitiveLib@2026.7", + ), + "CaseSensitiveLib", + "pkg:generic/Vendor/CaseSensitiveLib@2026.7", + "purl:pkg:generic/Vendor/CaseSensitiveLib", + ), + ], + ids=["nuget", "maven", "npm-scope", "generic"], +) +def test_canonicalize_component_identity_uses_ecosystem_matrix_without_universal_lowercase( + component: Component, + expected_package_name: str, + expected_purl: str, + expected_key: str, +) -> None: + identity = canonicalize_component_identity(component) + + assert identity.package_name == expected_package_name + assert identity.purl == expected_purl + assert identity.component_key == expected_key + + +def test_canonicalize_component_identity_preserves_unknown_ecosystem_coordinate_case() -> None: component = Component( - name="EnterpriseLibrary.Common", - version="6.0.1304", - ecosystem="nuget", - purl="pkg:nuget/EnterpriseLibrary.Common@6.0.1304", + name="CaseSensitiveLib", + version="2026.7", + ecosystem="Custom", ) identity = canonicalize_component_identity(component) - assert identity.package_name == "EnterpriseLibrary.Common" - assert identity.purl == "pkg:nuget/EnterpriseLibrary.Common@6.0.1304" - assert identity.component_key == "purl:pkg:nuget/EnterpriseLibrary.Common" + assert identity.ecosystem == "custom" + assert identity.package_name == "CaseSensitiveLib" + assert identity.component_key == "coord:custom:CaseSensitiveLib" def test_canonicalize_component_identity_does_not_invent_purl_version() -> None: