Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions tools/sbom-diff-and-risk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,11 @@ Before indexing, each component is converted to an immutable
`CanonicalComponentIdentity` containing normalized `ecosystem`,
`package_name`, `version`, `purl`, and `component_key` fields. PURL syntax is
parsed with the official `packageurl-python` implementation. PyPI package
names use PEP 503 normalization; names for ecosystems without an explicit
project rule preserve case.
names use PEP 503 normalization. npm package names follow the
`packageurl-python` npm purl name form. Names for ecosystems without an
explicit project rule preserve case.
The test-backed ecosystem matrix is documented in
[docs/component-identity-canonicalization.md](docs/component-identity-canonicalization.md).

The index fails closed with stable diagnostics:

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Component identity canonicalization matrix

`sbom-diff-and-risk` uses an explicit ecosystem matrix before it aligns
components across inputs. The matrix is intentionally narrow: the tool does
not apply a universal lowercase rule and does not infer package-manager
equivalence beyond the rows below.

The executable rule set lives in `sbom_diff_risk.component_identity` as
`canonicalization_rules()`.

## Matrix

| Ecosystem | Package name rule | Namespace rule | Version rule | Boundary |
| --- | --- | --- | --- | --- |
| `pypi` | `pep503`: trim, lowercase, and collapse `-`, `_`, and `.` name runs through `packaging.utils.canonicalize_name` | Preserve any parsed purl namespace; do not invent one | Preserve the observed version string after trimming | Does not resolve Python extras, environment markers, indexes, or package availability |
| `maven` | `preserve-observed`: trim only | Preserve the parsed purl namespace in `component_key` | Preserve the observed version string after trimming | Does not validate Maven group/artifact naming rules or registry identity |
| `npm` | `packageurl-npm-name`: trim and lowercase the package name to match the `packageurl-python` npm purl name form | Preserve the parsed purl namespace, including scope-like namespaces when present | Preserve the observed version string after trimming | Does not query the npm registry or infer scoped package aliases |
| `nuget` | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | Does not apply NuGet registry lookup or package ID equivalence |
| `generic` | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | No package-manager semantics are inferred |
| unknown ecosystem | `preserve-observed`: trim only | Preserve the parsed purl namespace | Preserve the observed version string after trimming | Treated as a local coordinate, not as a supported package-manager model |

The ecosystem identifier itself is always trimmed and lowercased so purl type
comparison stays deterministic. That rule does not imply package-name
lowercasing.

## Identity precedence

Identity authority remains:

1. purl package coordinate, without version;
2. `bom_ref`, when no purl is present;
3. normalized `(ecosystem, package_name)` coordinate.

When a purl is present, its type, package name, and version must agree with
explicit component fields. Disagreement fails closed as `conflicting_metadata`.
Within a single input, repeated identical canonical identities fail as
`duplicate_component`; repeated identities with different normalized metadata
fail as `conflicting_metadata`.

## Non-claims

This matrix is a deterministic comparison contract, not a resolver:

- it does not query package registries;
- it does not decide whether two ecosystem-specific coordinates are aliases;
- it does not rewrite versions into semantic-version equivalents;
- it does not make safety, malware, or CVE claims.
2 changes: 2 additions & 0 deletions tools/sbom-diff-and-risk/docs/parser-boundaries.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ The purl type, name, and version must agree with the corresponding explicit
component fields. Invalid or conflicting identity metadata fails closed as
`conflicting_metadata`; repeated identical records fail as
`duplicate_component`. See
[component-identity-canonicalization.md](component-identity-canonicalization.md)
for the ecosystem-specific canonicalization matrix and
[v1.1-input-and-policy-semantics.md](v1.1-input-and-policy-semantics.md) for the
typed identity contract.

Expand Down
14 changes: 9 additions & 5 deletions tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,17 @@ of this monorepo.
| Report schema identifier and compatibility tests | Implemented as `sbom-diff-risk.report.v1` across checked-in full-report fixtures |
| Per-decision rule, evidence, reason, and confidence | Implemented additively in report v1 policy finding objects |
| Component identity canonicalization | Implemented as a typed value object with stable duplicate/conflict diagnostics |
| Ecosystem-specific canonicalization matrix | Implemented in [component-identity-canonicalization.md](component-identity-canonicalization.md); package-name normalization is explicit per ecosystem rather than universal |

## Component identity contract

The canonical identity record will expose these dimensions separately:

- `ecosystem`: trimmed and normalized to a registered ecosystem identifier.
- `package_name`: normalized with ecosystem-aware rules. PyPI names use PEP
503 normalization; ecosystems without an explicit project rule preserve
case rather than inheriting a universal lowercase rule.
503 normalization; npm names use the `packageurl-python` npm purl name
form; ecosystems without an explicit project rule preserve case rather than
inheriting a universal lowercase rule.
- `version`: trimmed but otherwise preserved as observed. The tool will not
infer semantic equivalence between unrelated version schemes.
- `purl`: parsed with `packageurl-python` and normalized when present, while
Expand Down Expand Up @@ -54,9 +56,11 @@ Within one input:
evidence and do not become same-input conflicts.

The implementation introduces a frozen `CanonicalComponentIdentity` object and
keeps report presentation unchanged. Tests cover CycloneDX-to-SPDX alignment,
PyPI name normalization, case preservation for ecosystems without a declared
name rule, versioned purls, exact duplicates, and conflicting metadata.
keeps report presentation unchanged. The executable matrix is exposed through
`canonicalization_rules()` so tests can assert the supported ecosystem rules.
Tests cover CycloneDX-to-SPDX alignment, PyPI name normalization, case
preservation for ecosystems without a declared name rule, namespace retention,
versioned purls, exact duplicates, and conflicting metadata.

## Policy and decision contract

Expand Down
67 changes: 65 additions & 2 deletions tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@
from .models import Component


@dataclass(slots=True, frozen=True)
class EcosystemCanonicalizationRule:
ecosystem: str
package_name_rule: str
namespace_rule: str
version_rule: str


@dataclass(slots=True, frozen=True)
class CanonicalComponentIdentity:
ecosystem: str
Expand All @@ -18,6 +26,56 @@ class CanonicalComponentIdentity:
component_key: str


_REGISTERED_CANONICALIZATION_RULES: dict[str, EcosystemCanonicalizationRule] = {
"generic": EcosystemCanonicalizationRule(
ecosystem="generic",
package_name_rule="preserve-observed",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
),
"maven": EcosystemCanonicalizationRule(
ecosystem="maven",
package_name_rule="preserve-observed",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
),
"npm": EcosystemCanonicalizationRule(
ecosystem="npm",
package_name_rule="packageurl-npm-name",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
),
"nuget": EcosystemCanonicalizationRule(
ecosystem="nuget",
package_name_rule="preserve-observed",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
),
"pypi": EcosystemCanonicalizationRule(
ecosystem="pypi",
package_name_rule="pep503",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
),
}


def canonicalization_rules() -> tuple[EcosystemCanonicalizationRule, ...]:
return tuple(_REGISTERED_CANONICALIZATION_RULES[name] for name in sorted(_REGISTERED_CANONICALIZATION_RULES))


def canonicalization_rule_for_ecosystem(ecosystem: str) -> EcosystemCanonicalizationRule:
normalized_ecosystem = ecosystem.strip().lower()
if normalized_ecosystem in _REGISTERED_CANONICALIZATION_RULES:
return _REGISTERED_CANONICALIZATION_RULES[normalized_ecosystem]
return EcosystemCanonicalizationRule(
ecosystem=normalized_ecosystem,
package_name_rule="preserve-observed",
namespace_rule="preserve-purl-namespace",
version_rule="preserve-observed",
)


def canonicalize_component_identity(component: Component) -> CanonicalComponentIdentity:
explicit_ecosystem = component.ecosystem.strip().lower()
explicit_name = _canonical_package_name(explicit_ecosystem, component.name)
Expand Down Expand Up @@ -103,9 +161,14 @@ def _purl_component_key(parsed: PackageURL, ecosystem: str, package_name: str) -

def _canonical_package_name(ecosystem: str, name: str) -> str:
stripped = name.strip()
if ecosystem == "pypi":
rule = canonicalization_rule_for_ecosystem(ecosystem)
if rule.package_name_rule == "pep503":
return canonicalize_name(stripped)
return stripped
if rule.package_name_rule == "packageurl-npm-name":
return stripped.lower()
if rule.package_name_rule == "preserve-observed":
return stripped
raise AssertionError(f"unknown package name canonicalization rule: {rule.package_name_rule}")


def _optional_str(value: str | None) -> str | None:
Expand Down
115 changes: 106 additions & 9 deletions tools/sbom-diff-and-risk/tests/test_component_identity.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,47 @@
from __future__ import annotations

from pathlib import Path

import pytest

from sbom_diff_risk.component_identity import CanonicalComponentIdentity, canonicalize_component_identity
from sbom_diff_risk.component_identity import (
CanonicalComponentIdentity,
canonicalization_rule_for_ecosystem,
canonicalization_rules,
canonicalize_component_identity,
)
from sbom_diff_risk.errors import ComponentIdentityDiagnosticCode, ComponentIdentityError
from sbom_diff_risk.models import Component


def test_canonicalization_rules_expose_ecosystem_specific_matrix() -> None:
rules = {rule.ecosystem: rule for rule in canonicalization_rules()}

assert set(rules) == {"generic", "maven", "npm", "nuget", "pypi"}
assert rules["pypi"].package_name_rule == "pep503"
assert rules["maven"].package_name_rule == "preserve-observed"
assert rules["npm"].package_name_rule == "packageurl-npm-name"
assert rules["npm"].namespace_rule == "preserve-purl-namespace"
assert rules["nuget"].version_rule == "preserve-observed"


def test_canonicalization_rules_are_documented() -> None:
docs_path = Path(__file__).resolve().parents[1] / "docs" / "component-identity-canonicalization.md"
docs_text = docs_path.read_text(encoding="utf-8")

for rule in canonicalization_rules():
assert f"`{rule.ecosystem}`" in docs_text
assert f"`{rule.package_name_rule}`" in docs_text


def test_canonicalization_rule_for_unknown_ecosystem_preserves_observed_name() -> None:
rule = canonicalization_rule_for_ecosystem("CustomEcosystem")

assert rule.ecosystem == "customecosystem"
assert rule.package_name_rule == "preserve-observed"
assert rule.namespace_rule == "preserve-purl-namespace"


def test_canonicalize_component_identity_normalizes_pypi_coordinate() -> None:
component = Component(
name="Requests_Test",
Expand Down Expand Up @@ -40,19 +75,81 @@ def test_canonicalize_component_identity_uses_coordinate_without_purl() -> None:
assert identity.purl is None


def test_canonicalize_component_identity_preserves_unregistered_name_case() -> None:
@pytest.mark.parametrize(
("component", "expected_package_name", "expected_purl", "expected_key"),
[
(
Component(
name="EnterpriseLibrary.Common",
version="6.0.1304",
ecosystem="nuget",
purl="pkg:nuget/EnterpriseLibrary.Common@6.0.1304",
),
"EnterpriseLibrary.Common",
"pkg:nuget/EnterpriseLibrary.Common@6.0.1304",
"purl:pkg:nuget/EnterpriseLibrary.Common",
),
(
Component(
name="CaseSensitiveArtifact",
version="1.2.3",
ecosystem="maven",
purl="pkg:maven/Com.Example/CaseSensitiveArtifact@1.2.3",
),
"CaseSensitiveArtifact",
"pkg:maven/Com.Example/CaseSensitiveArtifact@1.2.3",
"purl:pkg:maven/Com.Example/CaseSensitiveArtifact",
),
(
Component(
name="LeftPad",
version="1.3.0",
ecosystem="npm",
purl="pkg:npm/%40ExampleScope/LeftPad@1.3.0",
),
"leftpad",
"pkg:npm/%40ExampleScope/leftpad@1.3.0",
"purl:pkg:npm/%40ExampleScope/leftpad",
),
(
Component(
name="CaseSensitiveLib",
version="2026.7",
ecosystem="generic",
purl="pkg:generic/Vendor/CaseSensitiveLib@2026.7",
),
"CaseSensitiveLib",
"pkg:generic/Vendor/CaseSensitiveLib@2026.7",
"purl:pkg:generic/Vendor/CaseSensitiveLib",
),
],
ids=["nuget", "maven", "npm-scope", "generic"],
)
def test_canonicalize_component_identity_uses_ecosystem_matrix_without_universal_lowercase(
component: Component,
expected_package_name: str,
expected_purl: str,
expected_key: str,
) -> None:
identity = canonicalize_component_identity(component)

assert identity.package_name == expected_package_name
assert identity.purl == expected_purl
assert identity.component_key == expected_key


def test_canonicalize_component_identity_preserves_unknown_ecosystem_coordinate_case() -> None:
component = Component(
name="EnterpriseLibrary.Common",
version="6.0.1304",
ecosystem="nuget",
purl="pkg:nuget/EnterpriseLibrary.Common@6.0.1304",
name="CaseSensitiveLib",
version="2026.7",
ecosystem="Custom",
)

identity = canonicalize_component_identity(component)

assert identity.package_name == "EnterpriseLibrary.Common"
assert identity.purl == "pkg:nuget/EnterpriseLibrary.Common@6.0.1304"
assert identity.component_key == "purl:pkg:nuget/EnterpriseLibrary.Common"
assert identity.ecosystem == "custom"
assert identity.package_name == "CaseSensitiveLib"
assert identity.component_key == "coord:custom:CaseSensitiveLib"


def test_canonicalize_component_identity_does_not_invent_purl_version() -> None:
Expand Down