Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion tools/sbom-diff-and-risk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ The v1.1 implementation sequence is fixed in
- Produce machine-friendly JSON and reviewer-friendly Markdown reports.
- Stay fully local-file based by default.

## v0.1 Internal Component Model
## Internal Component Model

The normalized schema is the core design choice for the project:

Expand All @@ -84,6 +84,24 @@ Diff identity is intentionally conservative and uses this precedence:

When a `purl` includes a version, the tool keeps the full value in `Component.purl` for auditability but uses the versionless package coordinate for identity so upgrades still diff as `changed`.

Before indexing, each component is converted to an immutable
`CanonicalComponentIdentity` containing normalized `ecosystem`,
`package_name`, `version`, `purl`, and `component_key` fields. PURL syntax is
parsed with the official `packageurl-python` implementation. PyPI package
names use PEP 503 normalization; names for ecosystems without an explicit
project rule preserve case.

The index fails closed with stable diagnostics:

- `duplicate_component` when one input repeats the same canonical identity and
normalized metadata;
- `conflicting_metadata` when records share an identity but disagree on
metadata, or when explicit ecosystem/name/version fields disagree with a
purl.

See [docs/v1.1-input-and-policy-semantics.md](docs/v1.1-input-and-policy-semantics.md)
for the v1.1 identity contract and compatibility boundary.

## Non-goals

- No vulnerability database integration in v0.1.
Expand Down
10 changes: 10 additions & 0 deletions tools/sbom-diff-and-risk/docs/parser-boundaries.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ The parser does not currently constrain the SPDX version or validate the
document against an SPDX schema. Relationships and file-level data do not
affect component identity or policy decisions.

## Component identity validation

After parsing, purl-bearing components are canonicalized before diff indexing.
The purl type, name, and version must agree with the corresponding explicit
component fields. Invalid or conflicting identity metadata fails closed as
`conflicting_metadata`; repeated identical records fail as
`duplicate_component`. See
[v1.1-input-and-policy-semantics.md](v1.1-input-and-policy-semantics.md) for the
typed identity contract.

## Requirements files

`requirements.txt` is treated as a narrow manifest format, not as "everything pip can do in a file".
Expand Down
26 changes: 17 additions & 9 deletions tools/sbom-diff-and-risk/docs/v1.1-input-and-policy-semantics.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,21 @@ of this monorepo.
| Policy schema identifier | Implemented as `sbom-diff-risk.policy.v1`; legacy policy files remain readable |
| Report schema identifier and compatibility tests | Implemented as `sbom-diff-risk.report.v1` across checked-in full-report fixtures |
| Per-decision rule, evidence, reason, and confidence | Implemented additively in report v1 policy finding objects |
| Component identity canonicalization | Next implementation slice; target semantics are fixed below |
| Component identity canonicalization | Implemented as a typed value object with stable duplicate/conflict diagnostics |

## Component identity target
## Component identity contract

The canonical identity record will expose these dimensions separately:

- `ecosystem`: trimmed and normalized to a registered ecosystem identifier.
- `package_name`: normalized with ecosystem-aware rules. PyPI names use PEP
503 normalization; other ecosystems require explicit test-backed rules.
503 normalization; ecosystems without an explicit project rule preserve
case rather than inheriting a universal lowercase rule.
- `version`: trimmed but otherwise preserved as observed. The tool will not
infer semantic equivalence between unrelated version schemes.
- `purl`: parsed and normalized when present, while retaining the observed purl
in component evidence for auditability.
- `purl`: parsed with `packageurl-python` and normalized when present, while
retaining the observed purl in component evidence for auditability. An
explicit component version does not get invented inside a versionless purl.
- `component_key`: versionless package identity used to align before and after
inputs. A version change remains a change, not an add plus remove.

Expand All @@ -34,6 +36,10 @@ Identity authority remains `purl`, then `bom_ref`, then the normalized
its ecosystem and package coordinate. Explicit metadata that disagrees with
that coordinate is a conflict, not an alternative identity.

Canonical identity also drives change comparison: lexical PyPI variants that
normalize to the same identity do not create a metadata change, and a version
change carried only by the purl is still classified as `version_changed`.

Within one input:

- two records with the same key and identical normalized metadata fail closed
Expand All @@ -42,13 +48,15 @@ Within one input:
as `conflicting_metadata`;
- conflicting ecosystem, package name, or version information between a purl
and explicit fields also fails closed as `conflicting_metadata`;
- an invalid purl also fails closed as `conflicting_metadata` because it cannot
establish an unambiguous canonical identity;
- metadata differences across the before and after inputs remain normal diff
evidence and do not become same-input conflicts.

The next code slice should introduce a typed canonical identity object and
diagnostic error codes before changing report presentation. Cross-format tests
must cover CycloneDX to SPDX alignment, PyPI name normalization, versioned
purls, exact duplicates, and conflicting metadata.
The implementation introduces a frozen `CanonicalComponentIdentity` object and
keeps report presentation unchanged. Tests cover CycloneDX-to-SPDX alignment,
PyPI name normalization, case preservation for ecosystems without a declared
name rule, versioned purls, exact duplicates, and conflicting metadata.

## Policy and decision contract

Expand Down
9 changes: 5 additions & 4 deletions tools/sbom-diff-and-risk/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,11 @@ classifiers = [
"Topic :: Security",
"Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = [
"packaging>=24.0",
"PyYAML>=6.0",
]
dependencies = [
"packaging>=24.0",
"packageurl-python>=0.17.6,<0.18",
"PyYAML>=6.0",
]

[project.urls]
Homepage = "https://github.com/stacknil/scientific-computing-toolkit"
Expand Down
115 changes: 115 additions & 0 deletions tools/sbom-diff-and-risk/src/sbom_diff_risk/component_identity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from __future__ import annotations

from dataclasses import dataclass

from packaging.utils import canonicalize_name
from packageurl import PackageURL

from .errors import ComponentIdentityDiagnosticCode, ComponentIdentityError
from .models import Component


@dataclass(slots=True, frozen=True)
class CanonicalComponentIdentity:
ecosystem: str
package_name: str
version: str | None
purl: str | None
component_key: str


def canonicalize_component_identity(component: Component) -> CanonicalComponentIdentity:
explicit_ecosystem = component.ecosystem.strip().lower()
explicit_name = _canonical_package_name(explicit_ecosystem, component.name)
explicit_version = _optional_str(component.version)

if component.purl is None:
if component.bom_ref:
component_key = f"bom-ref:{component.bom_ref.strip().lower()}"
else:
component_key = f"coord:{explicit_ecosystem}:{explicit_name}"
return CanonicalComponentIdentity(
ecosystem=explicit_ecosystem,
package_name=explicit_name,
version=explicit_version,
purl=None,
component_key=component_key,
)

parsed = _parse_purl(component.purl)
purl_ecosystem = parsed.type.strip().lower()
purl_name = _canonical_package_name(purl_ecosystem, parsed.name)
purl_version = _optional_str(parsed.version)

conflicts: list[str] = []
if explicit_ecosystem != purl_ecosystem:
conflicts.append(f"ecosystem={explicit_ecosystem!r} disagrees with purl type={purl_ecosystem!r}")
if explicit_name != purl_name:
conflicts.append(f"package name={explicit_name!r} disagrees with purl name={purl_name!r}")
if explicit_version is not None and purl_version is not None and explicit_version != purl_version:
conflicts.append(f"version={explicit_version!r} disagrees with purl version={purl_version!r}")
if conflicts:
raise ComponentIdentityError(
ComponentIdentityDiagnosticCode.CONFLICTING_METADATA,
"; ".join(conflicts),
component_key=_purl_component_key(parsed, purl_ecosystem, purl_name),
)

canonical_version = purl_version or explicit_version
canonical_purl = _canonical_purl(parsed, purl_ecosystem, purl_name, purl_version)
return CanonicalComponentIdentity(
ecosystem=purl_ecosystem,
package_name=purl_name,
version=canonical_version,
purl=canonical_purl,
component_key=_purl_component_key(parsed, purl_ecosystem, purl_name),
)


def _parse_purl(raw_purl: str) -> PackageURL:
try:
return PackageURL.from_string(raw_purl.strip())
except ValueError as exc:
raise ComponentIdentityError(
ComponentIdentityDiagnosticCode.CONFLICTING_METADATA,
f"purl is not valid: {exc}",
) from exc


def _canonical_purl(
parsed: PackageURL,
ecosystem: str,
package_name: str,
version: str | None,
) -> str:
return PackageURL(
type=ecosystem,
namespace=parsed.namespace,
name=package_name,
version=version,
qualifiers=parsed.qualifiers,
subpath=parsed.subpath,
).to_string()


def _purl_component_key(parsed: PackageURL, ecosystem: str, package_name: str) -> str:
identity_purl = PackageURL(
type=ecosystem,
namespace=parsed.namespace,
name=package_name,
).to_string()
return f"purl:{identity_purl}"


def _canonical_package_name(ecosystem: str, name: str) -> str:
stripped = name.strip()
if ecosystem == "pypi":
return canonicalize_name(stripped)
return stripped


def _optional_str(value: str | None) -> str | None:
if value is None:
return None
stripped = value.strip()
return stripped or None
87 changes: 46 additions & 41 deletions tools/sbom-diff-and-risk/src/sbom_diff_risk/diffing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,25 @@

from typing import Iterable

from .component_identity import canonicalize_component_identity
from .errors import ComponentIdentityDiagnosticCode, ComponentIdentityError
from .models import Component, ComponentChange


def component_key(component: Component) -> str:
"""Return a stable identity with purl -> bom_ref -> (ecosystem, name)."""
if component.purl:
return f"purl:{_purl_identity(component.purl)}"
if component.bom_ref:
return f"bom-ref:{component.bom_ref.strip().lower()}"
ecosystem = component.ecosystem.strip().lower()
name = component.name.strip().lower()
return f"coord:{ecosystem}:{name}"


def _purl_identity(purl: str) -> str:
candidate = purl.strip().lower()
if not candidate.startswith("pkg:"):
return candidate

end = len(candidate)
for separator in ("?", "#"):
position = candidate.find(separator)
if position != -1:
end = min(end, position)

base = candidate[:end]
version_separator = base.rfind("@")
name_separator = base.rfind("/")
if version_separator != -1 and version_separator > name_separator:
return base[:version_separator]

return base
return canonicalize_component_identity(component).component_key


def _component_signature(component: Component) -> tuple[object, ...]:
identity = canonicalize_component_identity(component)
return (
component.name,
component.version,
component.ecosystem,
component.purl,
component.license_id,
component.supplier,
component.source_url,
component.bom_ref,
component.raw_type,
identity,
_normalized_metadata(component.license_id),
_normalized_metadata(component.supplier),
_normalized_metadata(component.source_url),
_normalized_metadata(component.bom_ref, lower=True),
_normalized_metadata(component.raw_type, lower=True),
)


Expand All @@ -71,9 +45,11 @@ def diff_components(
if _component_signature(before_component) == _component_signature(after_component):
continue

classification = "version_changed"
if before_component.version == after_component.version:
classification = "metadata_changed"
before_identity = canonicalize_component_identity(before_component)
after_identity = canonicalize_component_identity(after_component)
classification = (
"version_changed" if before_identity.version != after_identity.version else "metadata_changed"
)

changed.append(
ComponentChange(
Expand All @@ -90,8 +66,37 @@ def diff_components(
def _index_components(components: Iterable[Component], side: str) -> dict[str, Component]:
indexed: dict[str, Component] = {}
for component in components:
key = component_key(component)
try:
key = component_key(component)
except ComponentIdentityError as exc:
raise ComponentIdentityError(
exc.code,
f"{exc.detail} in {side} input",
side=side,
component_key=exc.component_key,
) from exc
if key in indexed:
raise ValueError(f"Duplicate component identity in {side} input: {key}")
existing = indexed[key]
if _component_signature(existing) == _component_signature(component):
code = ComponentIdentityDiagnosticCode.DUPLICATE_COMPONENT
label = "duplicate component"
else:
code = ComponentIdentityDiagnosticCode.CONFLICTING_METADATA
label = "conflicting metadata"
raise ComponentIdentityError(
code,
f"{label} in {side} input for {key}",
side=side,
component_key=key,
)
indexed[key] = component
return indexed


def _normalized_metadata(value: str | None, *, lower: bool = False) -> str | None:
if value is None:
return None
normalized = value.strip()
if lower:
normalized = normalized.lower()
return normalized or None
25 changes: 25 additions & 0 deletions tools/sbom-diff-and-risk/src/sbom_diff_risk/errors.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from __future__ import annotations

from enum import StrEnum


class ParseError(ValueError):
"""Raised when an input file cannot be parsed into normalized components."""
Expand All @@ -19,3 +21,26 @@ class InputSelectionError(ParseError):

class PolicyError(ValueError):
"""Raised when policy parsing or evaluation inputs are invalid."""


class ComponentIdentityDiagnosticCode(StrEnum):
DUPLICATE_COMPONENT = "duplicate_component"
CONFLICTING_METADATA = "conflicting_metadata"


class ComponentIdentityError(ValueError):
"""Raised when one input cannot produce an unambiguous component index."""

def __init__(
self,
code: ComponentIdentityDiagnosticCode,
message: str,
*,
side: str | None = None,
component_key: str | None = None,
) -> None:
self.code = code
self.detail = message
self.side = side
self.component_key = component_key
super().__init__(f"{code.value}: {message}")
Loading