diff --git a/CHANGELOG.md b/CHANGELOG.md index 3ff71c8..cc675d3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,13 +6,23 @@ All notable user-visible changes should be recorded here. ### Added -- None yet. +- Added stable JSON finding identity fields: `finding_id` and + `episode_index`. +- Added a separated-burst syslog report-contract fixture where one source IP + produces two time-separated brute-force findings. +- Added detector regression coverage for stable episode identity under unsorted + input order and inclusive window-boundary behavior. +- Added parser regression coverage for malformed source-IP token + classification. ### Changed - Detector rules now emit separate findings for time-separated detection episodes within the same rule subject instead of collapsing each subject to a single best window. +- Bumped the JSON report artifact contract from `loglens.report.v2` / + `schema_version` 2 to `loglens.report.v3` / `schema_version` 3 for finding + identity fields. ### Fixed @@ -22,6 +32,8 @@ All notable user-visible changes should be recorded here. - Documented detection episode semantics in the rule catalog and report artifact contract notes. +- Added the v0.6 Detection Episode Semantics release note and schema v2 to v3 + migration guidance. ## v0.5.0 diff --git a/README.md b/README.md index aa821db..e753257 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,9 @@ or attribution: ```json { + "finding_id": "finding:brute_force:4e6aec401a0d45ca", "rule_id": "brute_force", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -31,7 +33,7 @@ or attribution: **Release posture:** Early reviewer-stable release with a narrow Linux authentication evidence contract. Parser and detection coverage remain intentionally narrow. -Reviewing the project quickly? Start with [`docs/reviewer-path.md`](./docs/reviewer-path.md), [`docs/reviewer-brief.md`](./docs/reviewer-brief.md), and the [`v0.5 Evidence Explainability release note`](./docs/release-v0.5.0.md). The [`quality gates map`](./docs/quality-gates.md) links claims to tests and fixtures. For detection reasoning, follow the [`one-page incident-style case`](./docs/incident-style-case.md), then use the full [`Linux auth brute-force case study`](./docs/case-study-linux-auth-bruteforce.md), [`rule catalog`](./docs/rule-catalog.md), and [`false-positive taxonomy`](./docs/false-positive-taxonomy.md) for depth. For local scale expectations, see the [`performance envelope`](./docs/performance-envelope.md). +Reviewing the project quickly? Start with [`docs/reviewer-path.md`](./docs/reviewer-path.md), [`docs/reviewer-brief.md`](./docs/reviewer-brief.md), the [`v0.5 Evidence Explainability release note`](./docs/release-v0.5.0.md), and the [`v0.6 Detection Episode Semantics release note`](./docs/release-v0.6.0.md). The [`quality gates map`](./docs/quality-gates.md) links claims to tests and fixtures. For detection reasoning, follow the [`one-page incident-style case`](./docs/incident-style-case.md), then use the full [`Linux auth brute-force case study`](./docs/case-study-linux-auth-bruteforce.md), [`rule catalog`](./docs/rule-catalog.md), and [`false-positive taxonomy`](./docs/false-positive-taxonomy.md) for depth. For local scale expectations, see the [`performance envelope`](./docs/performance-envelope.md). For a shorter external review entry point focused on uncertainty handling, read [How LogLens Treats Parser Uncertainty as Evidence](./docs/case-study-parser-uncertainty-as-evidence.md). @@ -62,7 +64,7 @@ LogLens includes two minimal GitHub Actions workflows: - `CI` builds and tests the project on `ubuntu-latest` and `windows-latest` - `CodeQL` runs GitHub code scanning for C/C++ on pushes, pull requests, and a weekly schedule -Both workflows are intended to stay stable enough to require on pull requests to `main`. Regression coverage is backed by sanitized parser fixture matrices plus golden report-contract fixtures for `report.md`, `report.json`, and optional CSV outputs. Release-facing documentation is split across [`CHANGELOG.md`](./CHANGELOG.md), [`docs/release-process.md`](./docs/release-process.md), [`docs/release-v0.1.0.md`](./docs/release-v0.1.0.md), [`docs/release-v0.3.0.md`](./docs/release-v0.3.0.md), [`docs/release-v0.5.0.md`](./docs/release-v0.5.0.md), and the repository's GitHub release notes. The repository hardening note is in [`docs/repo-hardening.md`](./docs/repo-hardening.md), and vulnerability reporting guidance is in [`SECURITY.md`](./SECURITY.md). +Both workflows are intended to stay stable enough to require on pull requests to `main`. Regression coverage is backed by sanitized parser fixture matrices plus golden report-contract fixtures for `report.md`, `report.json`, and optional CSV outputs. Release-facing documentation is split across [`CHANGELOG.md`](./CHANGELOG.md), [`docs/release-process.md`](./docs/release-process.md), [`docs/release-v0.1.0.md`](./docs/release-v0.1.0.md), [`docs/release-v0.3.0.md`](./docs/release-v0.3.0.md), [`docs/release-v0.5.0.md`](./docs/release-v0.5.0.md), [`docs/release-v0.6.0.md`](./docs/release-v0.6.0.md), and the repository's GitHub release notes. The repository hardening note is in [`docs/repo-hardening.md`](./docs/repo-hardening.md), and vulnerability reporting guidance is in [`SECURITY.md`](./SECURITY.md). ## Threat Model @@ -86,8 +88,9 @@ LogLens currently detects: Each rule can emit multiple findings for the same subject when matching evidence appears in time-separated detector episodes. Report consumers should -use `window_start`, `window_end`, and `evidence_event_ids` rather than assuming -one finding per `rule_id` and subject. +use `finding_id`, `episode_index`, `window_start`, `window_end`, and +`evidence_event_ids` rather than assuming one finding per `rule_id` and +subject. LogLens currently parses and reports these additional auth patterns beyond the core detector inputs: diff --git a/docs/quality-gates.md b/docs/quality-gates.md index f306e5a..1ae0e73 100644 --- a/docs/quality-gates.md +++ b/docs/quality-gates.md @@ -15,7 +15,8 @@ The main review principle is: | Parser coverage is visible | [`parser-coverage-notes.md`](./parser-coverage-notes.md), [`tests/fixtures/parser_matrix/noisy_auth_expected.json`](../tests/fixtures/parser_matrix/noisy_auth_expected.json) | `test_parser` compares noisy-auth coverage output to the checked-in expected summary | Reviewer can see parsed lines, skipped blanks, warnings, failure categories, and unknown-pattern buckets | | Unsupported evidence does not silently become detector evidence | [`parser-contract.md`](./parser-contract.md), [`rule-catalog.md`](./rule-catalog.md), [`case-study-linux-auth-bruteforce.md`](./case-study-linux-auth-bruteforce.md) | `test_parser` covers unknown-pattern warnings; `test_detector` covers signal-boundary behavior | Reviewer can explain why unsupported lines remain warnings instead of findings | | Report artifacts are deterministic | [`report-artifacts.md`](./report-artifacts.md), report-contract fixtures under [`tests/fixtures/report_contracts`](../tests/fixtures/report_contracts) | `test_report_contracts` compares generated `report.md`, `report.json`, `findings.csv`, and `warnings.csv` against golden fixtures | Reviewer can regenerate reports and see schema or text changes as explicit snapshot diffs | -| Findings are explainable | [`rule-catalog.md`](./rule-catalog.md), [`report-artifacts.md`](./report-artifacts.md) | `test_report` checks JSON finding fields; report-contract fixtures lock `rule_id`, `window_start`, `window_end`, `threshold`, `observed_count`, `grouping_key`, `evidence_event_ids`, and `verdict_boundary` | Reviewer can trace a finding from rule context back to source line IDs and see the non-verdict boundary | +| Findings are explainable | [`rule-catalog.md`](./rule-catalog.md), [`report-artifacts.md`](./report-artifacts.md) | `test_report` checks JSON finding fields; report-contract fixtures lock `finding_id`, `episode_index`, `rule_id`, `window_start`, `window_end`, `threshold`, `observed_count`, `grouping_key`, `evidence_event_ids`, and `verdict_boundary` | Reviewer can trace a finding from rule context back to source line IDs and see the non-verdict boundary | +| Detection episodes are explicit | [`release-v0.6.0.md`](./release-v0.6.0.md), [`rule-catalog.md`](./rule-catalog.md#detection-episode-semantics), [`separated_bursts_syslog`](../tests/fixtures/report_contracts/separated_bursts_syslog) | `test_detector` covers separated episodes, stable identity under unsorted input order, and inclusive boundary windows; `test_report_contracts` locks the two-finding separated-burst report | Reviewer can see one rule and subject emit two non-overlapping findings without treating either as an incident verdict | | False-positive boundaries are visible | [`rule-catalog.md`](./rule-catalog.md), [`case-study-linux-auth-bruteforce.md`](./case-study-linux-auth-bruteforce.md) | Documentation review gate; detector tests ensure unsupported evidence does not inflate counts | Reviewer can state NAT, internal scanner, lab replay, shared bastion, scheduled admin task, and malformed replay boundaries | | Parser failure taxonomy is exposed | [`parser-contract.md`](./parser-contract.md), [`parser-conformance-matrix.md`](./parser-conformance-matrix.md), [`report-artifacts.md`](./report-artifacts.md) | `test_parser`, `test_report`, `test_cli`, and `test_report_contracts` cover `failure_categories` and warning `category` output | Reviewer can distinguish timestamp, program, known-program unknown-message, malformed-source-IP, and unsupported-PAM failures | | Local scale expectations are reproducible | [`performance-envelope.md`](./performance-envelope.md), [`scripts/benchmark-performance-envelope.ps1`](../scripts/benchmark-performance-envelope.ps1) | `pwsh -File scripts/benchmark-performance-envelope.ps1` regenerates sanitized benchmark inputs and local summary artifacts | Reviewer can reproduce the 1k/10k/100k-line envelope and understand its caveats | @@ -41,6 +42,8 @@ update the matching evidence surface in the same pull request: - parser behavior change: update parser tests, fixture matrices, and parser docs - report shape change: update report-contract fixtures and report artifact docs - rule behavior change: update detector tests, rule catalog, and case-study text when relevant +- episode semantics change: update detector tests, separated-burst report + fixtures, schema migration notes, and rule catalog policy text - warning taxonomy change: update parser failure taxonomy docs and warning snapshots - performance-envelope change: rerun the benchmark harness and record the platform/result source diff --git a/docs/release-v0.6.0.md b/docs/release-v0.6.0.md new file mode 100644 index 0000000..cf30b8c --- /dev/null +++ b/docs/release-v0.6.0.md @@ -0,0 +1,112 @@ +# LogLens v0.6.0 - Detection Episode Semantics + +Theme: Detection Episode Semantics. + +This release note describes the v0.6 report and detector contract. It does not +add new detection rules. It makes repeated time-separated findings for the same +rule subject explicit and reviewable. + +## What Changed + +- A single `rule_id`, `subject_kind`, and `subject` can emit multiple + non-overlapping findings when evidence appears in separated detector + episodes. +- JSON reports now use `schema: loglens.report.v3` and `schema_version: 3`. +- JSON findings include stable finding identity fields: + - `finding_id` + - `episode_index` +- The separated-burst contract fixture demonstrates one source IP producing two + distinct brute-force findings in one report. + +## Stable JSON Contract + +`loglens.report.v3` keeps the v0.5 explainability fields and adds: + +| Field | Meaning | +| --- | --- | +| `finding_id` | Deterministic report-local identifier for the selected finding, derived from the rule, subject, selected window, counts, and evidence event IDs. | +| `episode_index` | 1-based sequence number within the same `rule_id`, `subject_kind`, and `subject`. | + +Existing v2 finding fields remain part of the stable explainability surface: + +- `rule_id` +- `subject_kind` +- `subject` +- `grouping_key` +- `window_start` +- `window_end` +- `threshold` +- `observed_count` +- `evidence_event_ids` +- `verdict_boundary` + +The optional CSV contract is unchanged in v0.6. + +## Episode Policy + +LogLens v0.6 uses cooldown-separated maximal-window episodes: + +| Policy point | v0.6 behavior | +| --- | --- | +| First threshold crossing | Used to decide that an episode candidate is eligible to emit a finding. It is not necessarily the reported window. | +| Maximal window | The reported window is the highest-signal sliding window within the episode candidate. | +| Non-overlapping windows | One rule and subject can emit multiple findings, but selected episode candidates do not reuse the same matching signals. | +| Cooldown merge | Signals separated by an idle gap less than or equal to the rule window stay in the same episode candidate. A larger gap starts a new candidate. | + +Episode splitting is a reporting model. It is not an incident boundary. + +## Separated-Burst Fixture + +The fixture +[`tests/fixtures/report_contracts/separated_bursts_syslog/input.log`](../tests/fixtures/report_contracts/separated_bursts_syslog/input.log) +contains one sanitized source IP with: + +- five failed SSH attempts from `09:00:00` through `09:04:00` +- five failed SSH attempts from `15:00:00` through `15:04:00` + +The expected +[`report.json`](../tests/fixtures/report_contracts/separated_bursts_syslog/report.json) +contains two `brute_force` findings for the same subject: + +- `episode_index: 1`, window `2026-03-10 09:00:00` to + `2026-03-10 09:04:00` +- `episode_index: 2`, window `2026-03-10 15:00:00` to + `2026-03-10 15:04:00` + +This fixture locks the main v0.6 behavior: repeated separated bursts are no +longer collapsed to one best window. + +## Schema v2 to v3 Migration + +Consumers should treat `schema` and `schema_version` as the report shape gate: + +- v2: `loglens.report.v2`, `schema_version: 2` +- v3: `loglens.report.v3`, `schema_version: 3` + +Consumers that keyed findings by `rule_id` and `subject` should migrate to +`finding_id`, or include `episode_index`, `window_start`, `window_end`, and +`evidence_event_ids` in their own composite key. + +## Validation Surface + +v0.6 is covered by: + +- detector tests for separated brute-force, multi-user probing, and sudo-burst + episodes +- detector tests for stable episode identity under unsorted input order +- detector tests for inclusive rule-window boundaries +- parser tests for malformed source-IP token classification +- report tests for `finding_id`, `episode_index`, and schema v3 output +- golden report-contract fixtures for Markdown, JSON, and optional CSV reports + +## Non-Claims + +LogLens v0.6 findings remain bounded triage signals. The release preserves +these explicit non-claims: + +- no compromise verdict +- no attribution +- no blocking recommendation +- no cross-host correlation + +Findings remain bounded triage signals over normalized local evidence. diff --git a/docs/report-artifacts.md b/docs/report-artifacts.md index f7e0151..25f93e7 100644 --- a/docs/report-artifacts.md +++ b/docs/report-artifacts.md @@ -40,11 +40,13 @@ The JSON report keeps parser observability visible next to findings: - `findings` - `warnings` -Finding objects contain `rule_id`, `rule`, `subject_kind`, `subject`, `grouping_key`, `threshold`, `observed_count`, `event_count`, `window_start`, `window_end`, `evidence_event_ids`, `verdict_boundary`, `usernames`, and `summary`. +Finding objects contain `finding_id`, `rule_id`, `rule`, `episode_index`, `subject_kind`, `subject`, `grouping_key`, `threshold`, `observed_count`, `event_count`, `window_start`, `window_end`, `evidence_event_ids`, `verdict_boundary`, `usernames`, and `summary`. -The stable finding explainability surface for `loglens.report.v2` is: +The stable finding explainability surface for `loglens.report.v3` is: +- `finding_id` - `rule_id` +- `episode_index` - `subject_kind` - `subject` - `grouping_key` @@ -61,10 +63,20 @@ fixtures explicitly. `evidence_event_ids` are deterministic local event identifiers derived from the source line number, formatted as `line:`. They let reviewers trace a finding back to the normalized input events that satisfied the rule window without implying global event identity. +`finding_id` is a deterministic report-local finding identifier derived from +the rule, subject, selected window, counts, and evidence event IDs. It is +stable for the same normalized evidence and rule output, but it is not a global +case identifier. + +`episode_index` is a 1-based sequence number within one `rule_id`, +`subject_kind`, and `subject`. It is meant for reviewer navigation when a rule +emits more than one finding for the same subject. + Consumers should not assume that `rule_id` plus `subject` is unique within a report. A rule can emit multiple findings for the same subject when matching -evidence appears in time-separated detector episodes. Use `window_start`, -`window_end`, and `evidence_event_ids` to distinguish episode-level findings. +evidence appears in time-separated detector episodes. Use `finding_id`, +`episode_index`, `window_start`, `window_end`, and `evidence_event_ids` to +distinguish episode-level findings. `verdict_boundary` is a stable token that states what the finding must not be read as. It keeps machine-readable findings aligned with LogLens's triage @@ -79,7 +91,20 @@ Warning objects contain the original `line_number`, parser `category`, and parse `schema` and `schema_version` identify the report artifact contract, not the application release. They are intended for downstream tooling that needs a stable way to reject incompatible report shapes. The current JSON contract is -`loglens.report.v2` with `schema_version` set to `2`. +`loglens.report.v3` with `schema_version` set to `3`. + +### Schema v2 to v3 Migration + +`loglens.report.v3` keeps the v2 finding explainability fields and adds: + +- `finding_id` +- `episode_index` + +Downstream consumers should treat `schema` and `schema_version` as the report +shape gate. Consumers that keyed findings by `rule_id` and `subject` should +move to `finding_id`, or include `episode_index`, `window_start`, `window_end`, +and `evidence_event_ids` in their own composite key. The optional CSV contract +is unchanged in v3. Parser failure categories are stable reviewer-facing buckets for unsupported lines: `unknown_timestamp`, `unknown_program`, @@ -111,6 +136,7 @@ The report contracts are backed by generated fixture artifacts: | [`journalctl_short_full`](../tests/fixtures/report_contracts/journalctl_short_full) | `report.md`, `report.json`, `findings.csv`, `warnings.csv` | | [`multi_host_syslog_legacy`](../tests/fixtures/report_contracts/multi_host_syslog_legacy) | `report.md`, `report.json`, `findings.csv`, `warnings.csv` | | [`multi_host_journalctl_short_full`](../tests/fixtures/report_contracts/multi_host_journalctl_short_full) | `report.md`, `report.json`, `findings.csv`, `warnings.csv` | +| [`separated_bursts_syslog`](../tests/fixtures/report_contracts/separated_bursts_syslog) | `report.md`, `report.json`, `findings.csv`, `warnings.csv` | The enforcement lives in [`tests/test_report_contracts.cpp`](../tests/test_report_contracts.cpp). Parser or rule changes that alter report artifacts must update these snapshots explicitly. This includes changes to stable finding explainability fields, parser coverage fields, warning categories, CSV columns, or Markdown report layout. The focused report writer tests live in [`tests/test_report.cpp`](../tests/test_report.cpp). diff --git a/docs/reviewer-path.md b/docs/reviewer-path.md index 8168741..1ae2dce 100644 --- a/docs/reviewer-path.md +++ b/docs/reviewer-path.md @@ -8,6 +8,7 @@ This path is for reviewers who want to understand LogLens quickly without readin | --- | --- | --- | | What is LogLens? | [`README.md`](../README.md) and [`docs/reviewer-brief.md`](./reviewer-brief.md) | Can state scope, supported inputs, outputs, and non-goals | | What changed in v0.5? | [`docs/release-v0.5.0.md`](./release-v0.5.0.md) | Can explain the Evidence Explainability Release theme and its non-claims | +| What changed in v0.6? | [`docs/release-v0.6.0.md`](./release-v0.6.0.md) | Can explain repeated detection episodes, `finding_id`, `episode_index`, and schema v3 migration | | What log formats are supported? | [`docs/parser-contract.md`](./parser-contract.md) | Can name `syslog_legacy` and `journalctl_short_full` behavior | | What artifacts does it produce? | [`docs/report-artifacts.md`](./report-artifacts.md) and report-contract fixtures | Can inspect Markdown, JSON, and optional CSV outputs | | How do rules use evidence? | [`docs/rule-catalog.md`](./rule-catalog.md) | Can explain grouping keys, windows, thresholds, and unsupported-evidence boundaries | @@ -25,6 +26,7 @@ Read: - [`README.md`](../README.md) - [`docs/reviewer-brief.md`](./reviewer-brief.md) - [`docs/release-v0.5.0.md`](./release-v0.5.0.md) +- [`docs/release-v0.6.0.md`](./release-v0.6.0.md) Confirm: @@ -58,6 +60,21 @@ Use the release note's [`Release readiness checklist`](./release-v0.5.0.md#release-readiness-checklist) as the compact pass/fail map for the v0.5 scope. +## v0.6 release-facing route + +Start with [`docs/release-v0.6.0.md`](./release-v0.6.0.md), then inspect: + +- [`docs/report-artifacts.md`](./report-artifacts.md) +- [`docs/rule-catalog.md`](./rule-catalog.md#detection-episode-semantics) +- [`tests/fixtures/report_contracts/separated_bursts_syslog/input.log`](../tests/fixtures/report_contracts/separated_bursts_syslog/input.log) +- [`tests/fixtures/report_contracts/separated_bursts_syslog/report.json`](../tests/fixtures/report_contracts/separated_bursts_syslog/report.json) +- [`tests/test_detector.cpp`](../tests/test_detector.cpp) +- [`tests/test_report_contracts.cpp`](../tests/test_report_contracts.cpp) + +Good stopping point: the reviewer can explain why one rule and subject can emit +multiple findings, how `finding_id` and `episode_index` distinguish them, and +why the schema moved from `loglens.report.v2` to `loglens.report.v3`. + ## 5-minute artifact review Inspect: @@ -66,7 +83,9 @@ Inspect: - [`assets/sample_journalctl_short_full.log`](../assets/sample_journalctl_short_full.log) - [`tests/fixtures/report_contracts/syslog_legacy/report.md`](../tests/fixtures/report_contracts/syslog_legacy/report.md) - [`tests/fixtures/report_contracts/syslog_legacy/report.json`](../tests/fixtures/report_contracts/syslog_legacy/report.json) +- [`tests/fixtures/report_contracts/separated_bursts_syslog/report.json`](../tests/fixtures/report_contracts/separated_bursts_syslog/report.json) - [`docs/release-v0.5.0.md`](./release-v0.5.0.md) +- [`docs/release-v0.6.0.md`](./release-v0.6.0.md) - [`docs/report-artifacts.md`](./report-artifacts.md) - [`docs/parser-contract.md`](./parser-contract.md) - [`assets/mixed_auth_corpus.log`](../assets/mixed_auth_corpus.log) diff --git a/docs/rule-catalog.md b/docs/rule-catalog.md index 90f6e0a..3c87b03 100644 --- a/docs/rule-catalog.md +++ b/docs/rule-catalog.md @@ -30,10 +30,18 @@ Metadata equivalent: Within each rule grouping key, LogLens sorts matching signals by timestamp and source line number. Consecutive signals separated by an idle gap greater than -the rule window start a new episode candidate. +the rule window start a new episode candidate. The policy is +cooldown-separated maximal-window episodes: -Inside each episode candidate, the detector keeps the best sliding window for -the rule: +| Policy point | LogLens v0.6 behavior | +| --- | --- | +| First threshold crossing | Used only to determine that an episode candidate is eligible to emit a finding. The first crossing is not necessarily the reported window. | +| Maximal window | Within each episode candidate, the detector reports the highest-signal window for the rule. | +| Non-overlapping windows | One rule and subject can emit multiple findings, but their selected episode candidates do not reuse the same matching signals. | +| Cooldown merge | Signals separated by an idle gap less than or equal to the rule window stay in the same episode candidate. A larger idle gap starts a new candidate. | + +For the maximal-window step, the detector keeps the best sliding window for the +rule: - `brute_force` and `sudo_burst`: highest event count - `multi_user_probing`: highest distinct username count, with event count as @@ -42,8 +50,8 @@ the rule: Each episode candidate that reaches the configured threshold emits one finding. The same `rule_id` and `subject` can therefore appear more than once in one report when the evidence contains time-separated bursts. Review -`window_start`, `window_end`, and `evidence_event_ids` to distinguish those -episodes. +`finding_id`, `episode_index`, `window_start`, `window_end`, and +`evidence_event_ids` to distinguish those episodes. Episode splitting is a detector reporting model, not an incident boundary. It does not infer compromise, attribution, causality between rules, or cross-host @@ -53,7 +61,9 @@ correlation. JSON findings include both the finding conclusion and the rule context used to reach it: +- `finding_id`: deterministic report-local identifier for the selected finding - `rule_id`: stable rule identifier +- `episode_index`: 1-based sequence within the same `rule_id`, `subject_kind`, and `subject` - `grouping_key`: the normalized field used to group evidence - `threshold`: configured threshold for the rule - `observed_count`: observed value compared against the threshold diff --git a/src/detector.cpp b/src/detector.cpp index eb40e0a..ea5d9b0 100644 --- a/src/detector.cpp +++ b/src/detector.cpp @@ -1,6 +1,10 @@ #include "detector.hpp" #include +#include +#include +#include +#include #include #include @@ -25,6 +29,61 @@ struct MultiUserWindowSelection { bool matched = false; }; +std::string finding_rule_id_for_identity(const Finding& finding) { + if (!finding.rule_id.empty()) { + return finding.rule_id; + } + return to_string(finding.type); +} + +std::string finding_subject_kind_for_identity(const Finding& finding) { + if (!finding.subject_kind.empty()) { + return finding.subject_kind; + } + return finding.grouping_key; +} + +bool finding_sort_less(const Finding& left, const Finding& right) { + if (left.type != right.type) { + return to_string(left.type) < to_string(right.type); + } + if (left.subject_kind != right.subject_kind) { + return left.subject_kind < right.subject_kind; + } + if (left.subject != right.subject) { + return left.subject < right.subject; + } + if (left.first_seen != right.first_seen) { + return left.first_seen < right.first_seen; + } + if (left.last_seen != right.last_seen) { + return left.last_seen < right.last_seen; + } + return left.evidence_event_ids < right.evidence_event_ids; +} + +std::string finding_episode_key(const Finding& finding) { + return finding_rule_id_for_identity(finding) + + '\x1f' + finding_subject_kind_for_identity(finding) + + '\x1f' + finding.subject; +} + +void hash_append(std::uint64_t& hash, std::string_view value) { + constexpr std::uint64_t fnv_prime = 1099511628211ULL; + for (const unsigned char ch : value) { + hash ^= ch; + hash *= fnv_prime; + } + hash ^= 0xffU; + hash *= fnv_prime; +} + +std::string hex64(std::uint64_t value) { + std::ostringstream output; + output << std::hex << std::setfill('0') << std::setw(16) << value; + return output.str(); +} + std::vector sort_signals_by_time(const std::vector& signals) { auto sorted = signals; std::sort(sorted.begin(), sorted.end(), [](const AuthSignal* left, const AuthSignal* right) { @@ -372,6 +431,37 @@ std::string default_verdict_boundary(FindingType type) { } } +std::string build_finding_id(const Finding& finding) { + constexpr std::uint64_t fnv_offset_basis = 14695981039346656037ULL; + std::uint64_t hash = fnv_offset_basis; + + hash_append(hash, finding_rule_id_for_identity(finding)); + hash_append(hash, finding_subject_kind_for_identity(finding)); + hash_append(hash, finding.subject); + hash_append(hash, format_timestamp(finding.first_seen)); + hash_append(hash, format_timestamp(finding.last_seen)); + hash_append(hash, std::to_string(finding.threshold)); + const auto observed_count = finding.observed_count == 0 ? finding.event_count : finding.observed_count; + hash_append(hash, std::to_string(observed_count)); + hash_append(hash, std::to_string(finding.event_count)); + for (const auto& event_id : finding.evidence_event_ids) { + hash_append(hash, event_id); + } + + return "finding:" + finding_rule_id_for_identity(finding) + ":" + hex64(hash); +} + +void assign_finding_episode_identity(std::vector& findings) { + std::unordered_map episode_counts; + + for (auto& finding : findings) { + auto& episode_count = episode_counts[finding_episode_key(finding)]; + ++episode_count; + finding.episode_index = episode_count; + finding.finding_id = build_finding_id(finding); + } +} + Detector::Detector(DetectorConfig config) : config_(config) {} @@ -384,15 +474,8 @@ std::vector Detector::analyze(const std::vector& events) const { findings.insert(findings.end(), multi_user.begin(), multi_user.end()); findings.insert(findings.end(), sudo.begin(), sudo.end()); - std::sort(findings.begin(), findings.end(), [](const Finding& left, const Finding& right) { - if (left.type != right.type) { - return to_string(left.type) < to_string(right.type); - } - if (left.subject != right.subject) { - return left.subject < right.subject; - } - return left.first_seen < right.first_seen; - }); + std::sort(findings.begin(), findings.end(), finding_sort_less); + assign_finding_episode_identity(findings); return findings; } diff --git a/src/detector.hpp b/src/detector.hpp index cf9e40a..1f02ac7 100644 --- a/src/detector.hpp +++ b/src/detector.hpp @@ -29,7 +29,9 @@ struct DetectorConfig { struct Finding { FindingType type = FindingType::BruteForce; + std::string finding_id; std::string rule_id; + std::size_t episode_index = 0; std::string subject_kind; std::string subject; std::string grouping_key; @@ -46,6 +48,8 @@ struct Finding { std::string to_string(FindingType type); std::string default_verdict_boundary(FindingType type); +std::string build_finding_id(const Finding& finding); +void assign_finding_episode_identity(std::vector& findings); class Detector { public: diff --git a/src/report.cpp b/src/report.cpp index aec4526..01e08cd 100644 --- a/src/report.cpp +++ b/src/report.cpp @@ -215,11 +215,21 @@ std::vector sorted_findings(const std::vector& findings) { if (left.type != right.type) { return to_string(left.type) < to_string(right.type); } + if (left.subject_kind != right.subject_kind) { + return left.subject_kind < right.subject_kind; + } if (left.subject != right.subject) { return left.subject < right.subject; } - return left.first_seen < right.first_seen; + if (left.first_seen != right.first_seen) { + return left.first_seen < right.first_seen; + } + if (left.last_seen != right.last_seen) { + return left.last_seen < right.last_seen; + } + return left.evidence_event_ids < right.evidence_event_ids; }); + assign_finding_episode_identity(ordered); return ordered; } @@ -644,8 +654,8 @@ std::string render_json_report(const ReportData& data) { output << "{\n"; output << " \"tool\": \"LogLens\",\n"; - output << " \"schema\": \"loglens.report.v2\",\n"; - output << " \"schema_version\": 2,\n"; + output << " \"schema\": \"loglens.report.v3\",\n"; + output << " \"schema_version\": 3,\n"; output << " \"input\": \"" << escape_json(data.input_path.generic_string()) << "\",\n"; output << " \"input_mode\": \"" << to_string(data.parse_metadata.input_mode) << "\",\n"; if (data.parse_metadata.assume_year.has_value()) { @@ -712,8 +722,10 @@ std::string render_json_report(const ReportData& data) { for (std::size_t index = 0; index < findings.size(); ++index) { const auto& finding = findings[index]; output << " {\n"; + output << " \"finding_id\": \"" << escape_json(finding.finding_id) << "\",\n"; output << " \"rule_id\": \"" << escape_json(finding_rule_id(finding)) << "\",\n"; output << " \"rule\": \"" << to_string(finding.type) << "\",\n"; + output << " \"episode_index\": " << finding.episode_index << ",\n"; output << " \"subject_kind\": \"" << escape_json(finding.subject_kind) << "\",\n"; output << " \"subject\": \"" << escape_json(finding.subject) << "\",\n"; output << " \"grouping_key\": \"" << escape_json(finding_grouping_key(finding)) << "\",\n"; diff --git a/tests/fixtures/report_contracts/journalctl_short_full/report.json b/tests/fixtures/report_contracts/journalctl_short_full/report.json index 24e95ac..43aa90b 100644 --- a/tests/fixtures/report_contracts/journalctl_short_full/report.json +++ b/tests/fixtures/report_contracts/journalctl_short_full/report.json @@ -1,7 +1,7 @@ { "tool": "LogLens", - "schema": "loglens.report.v2", - "schema_version": 2, + "schema": "loglens.report.v3", + "schema_version": 3, "input": "tests/fixtures/report_contracts/journalctl_short_full/input.log", "input_mode": "journalctl_short_full", "timezone_present": true, @@ -34,8 +34,10 @@ ], "findings": [ { + "finding_id": "finding:brute_force:4e6aec401a0d45ca", "rule_id": "brute_force", "rule": "brute_force", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -50,8 +52,10 @@ "summary": "5 failed SSH attempts from 203.0.113.10 within 10 minutes." }, { + "finding_id": "finding:multi_user_probing:d63d2a332522d0e3", "rule_id": "multi_user_probing", "rule": "multi_user_probing", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -66,8 +70,10 @@ "summary": "203.0.113.10 targeted 5 usernames within 15 minutes." }, { + "finding_id": "finding:sudo_burst:12c5005a84ce3296", "rule_id": "sudo_burst", "rule": "sudo_burst", + "episode_index": 1, "subject_kind": "username", "subject": "alice", "grouping_key": "username", diff --git a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json index 69f0f36..3d19049 100644 --- a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json +++ b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json @@ -1,7 +1,7 @@ { "tool": "LogLens", - "schema": "loglens.report.v2", - "schema_version": 2, + "schema": "loglens.report.v3", + "schema_version": 3, "input": "tests/fixtures/report_contracts/multi_host_journalctl_short_full/input.log", "input_mode": "journalctl_short_full", "timezone_present": true, @@ -62,8 +62,10 @@ ], "findings": [ { + "finding_id": "finding:brute_force:60b8cb14f1f32393", "rule_id": "brute_force", "rule": "brute_force", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -78,8 +80,10 @@ "summary": "5 failed SSH attempts from 203.0.113.10 within 10 minutes." }, { + "finding_id": "finding:multi_user_probing:2ca1172c80b28d2a", "rule_id": "multi_user_probing", "rule": "multi_user_probing", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -94,8 +98,10 @@ "summary": "203.0.113.10 targeted 5 usernames within 15 minutes." }, { + "finding_id": "finding:sudo_burst:76fbf33350f4b4fc", "rule_id": "sudo_burst", "rule": "sudo_burst", + "episode_index": 1, "subject_kind": "username", "subject": "alice", "grouping_key": "username", diff --git a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json index 177dbed..102ad89 100644 --- a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json +++ b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json @@ -1,7 +1,7 @@ { "tool": "LogLens", - "schema": "loglens.report.v2", - "schema_version": 2, + "schema": "loglens.report.v3", + "schema_version": 3, "input": "tests/fixtures/report_contracts/multi_host_syslog_legacy/input.log", "input_mode": "syslog_legacy", "assume_year": 2026, @@ -63,8 +63,10 @@ ], "findings": [ { + "finding_id": "finding:brute_force:60b8cb14f1f32393", "rule_id": "brute_force", "rule": "brute_force", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -79,8 +81,10 @@ "summary": "5 failed SSH attempts from 203.0.113.10 within 10 minutes." }, { + "finding_id": "finding:multi_user_probing:2ca1172c80b28d2a", "rule_id": "multi_user_probing", "rule": "multi_user_probing", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -95,8 +99,10 @@ "summary": "203.0.113.10 targeted 5 usernames within 15 minutes." }, { + "finding_id": "finding:sudo_burst:76fbf33350f4b4fc", "rule_id": "sudo_burst", "rule": "sudo_burst", + "episode_index": 1, "subject_kind": "username", "subject": "alice", "grouping_key": "username", diff --git a/tests/fixtures/report_contracts/separated_bursts_syslog/findings.csv b/tests/fixtures/report_contracts/separated_bursts_syslog/findings.csv new file mode 100644 index 0000000..1392518 --- /dev/null +++ b/tests/fixtures/report_contracts/separated_bursts_syslog/findings.csv @@ -0,0 +1,3 @@ +rule,subject_kind,subject,event_count,window_start,window_end,usernames,summary +brute_force,source_ip,203.0.113.40,5,2026-03-10 09:00:00,2026-03-10 09:04:00,,5 failed SSH attempts from 203.0.113.40 within 10 minutes. +brute_force,source_ip,203.0.113.40,5,2026-03-10 15:00:00,2026-03-10 15:04:00,,5 failed SSH attempts from 203.0.113.40 within 10 minutes. diff --git a/tests/fixtures/report_contracts/separated_bursts_syslog/input.log b/tests/fixtures/report_contracts/separated_bursts_syslog/input.log new file mode 100644 index 0000000..24ffb61 --- /dev/null +++ b/tests/fixtures/report_contracts/separated_bursts_syslog/input.log @@ -0,0 +1,10 @@ +Mar 10 09:00:00 example-host sshd[3001]: Failed password for user001 from 203.0.113.40 port 52001 ssh2 +Mar 10 09:01:00 example-host sshd[3002]: Failed password for user001 from 203.0.113.40 port 52002 ssh2 +Mar 10 09:02:00 example-host sshd[3003]: Failed password for user001 from 203.0.113.40 port 52003 ssh2 +Mar 10 09:03:00 example-host sshd[3004]: Failed password for user001 from 203.0.113.40 port 52004 ssh2 +Mar 10 09:04:00 example-host sshd[3005]: Failed password for user001 from 203.0.113.40 port 52005 ssh2 +Mar 10 15:00:00 example-host sshd[3006]: Failed password for user001 from 203.0.113.40 port 53001 ssh2 +Mar 10 15:01:00 example-host sshd[3007]: Failed password for user001 from 203.0.113.40 port 53002 ssh2 +Mar 10 15:02:00 example-host sshd[3008]: Failed password for user001 from 203.0.113.40 port 53003 ssh2 +Mar 10 15:03:00 example-host sshd[3009]: Failed password for user001 from 203.0.113.40 port 53004 ssh2 +Mar 10 15:04:00 example-host sshd[3010]: Failed password for user001 from 203.0.113.40 port 53005 ssh2 diff --git a/tests/fixtures/report_contracts/separated_bursts_syslog/report.json b/tests/fixtures/report_contracts/separated_bursts_syslog/report.json new file mode 100644 index 0000000..e737b68 --- /dev/null +++ b/tests/fixtures/report_contracts/separated_bursts_syslog/report.json @@ -0,0 +1,67 @@ +{ + "tool": "LogLens", + "schema": "loglens.report.v3", + "schema_version": 3, + "input": "tests/fixtures/report_contracts/separated_bursts_syslog/input.log", + "input_mode": "syslog_legacy", + "assume_year": 2026, + "timezone_present": false, + "parser_quality": { + "total_input_lines": 10, + "total_lines": 10, + "skipped_blank_lines": 0, + "parsed_lines": 10, + "unparsed_lines": 0, + "parse_success_rate": 1.0000, + "top_unknown_patterns": [ + ], + "failure_categories": [ + ] + }, + "parsed_event_count": 10, + "warning_count": 0, + "finding_count": 2, + "event_counts": [ + {"event_type": "ssh_failed_password", "count": 10} + ], + "findings": [ + { + "finding_id": "finding:brute_force:de1640e1c994e851", + "rule_id": "brute_force", + "rule": "brute_force", + "episode_index": 1, + "subject_kind": "source_ip", + "subject": "203.0.113.40", + "grouping_key": "source_ip", + "threshold": 5, + "observed_count": 5, + "event_count": 5, + "window_start": "2026-03-10 09:00:00", + "window_end": "2026-03-10 09:04:00", + "evidence_event_ids": ["line:1", "line:2", "line:3", "line:4", "line:5"], + "verdict_boundary": "triage_signal_not_compromise_or_attribution", + "usernames": [], + "summary": "5 failed SSH attempts from 203.0.113.40 within 10 minutes." + }, + { + "finding_id": "finding:brute_force:d5c73d8bf41cdc59", + "rule_id": "brute_force", + "rule": "brute_force", + "episode_index": 2, + "subject_kind": "source_ip", + "subject": "203.0.113.40", + "grouping_key": "source_ip", + "threshold": 5, + "observed_count": 5, + "event_count": 5, + "window_start": "2026-03-10 15:00:00", + "window_end": "2026-03-10 15:04:00", + "evidence_event_ids": ["line:6", "line:7", "line:8", "line:9", "line:10"], + "verdict_boundary": "triage_signal_not_compromise_or_attribution", + "usernames": [], + "summary": "5 failed SSH attempts from 203.0.113.40 within 10 minutes." + } + ], + "warnings": [ + ] +} diff --git a/tests/fixtures/report_contracts/separated_bursts_syslog/report.md b/tests/fixtures/report_contracts/separated_bursts_syslog/report.md new file mode 100644 index 0000000..8fd9a85 --- /dev/null +++ b/tests/fixtures/report_contracts/separated_bursts_syslog/report.md @@ -0,0 +1,40 @@ +# LogLens Report + +## Summary + +- Input: `tests/fixtures/report_contracts/separated_bursts_syslog/input.log` +- Input mode: syslog_legacy +- Assume year: 2026 +- Timezone present: false +- Total input lines: 10 +- Total lines: 10 +- Skipped blank lines: 0 +- Parsed lines: 10 +- Unparsed lines: 0 +- Parse success rate: 100.00% +- Parsed events: 10 +- Findings: 2 +- Parser warnings: 0 + +## Findings + +| Rule | Subject | Count | Window | Notes | +| --- | --- | ---: | --- | --- | +| brute_force | 203.0.113.40 | 5 | 2026-03-10 09:00:00 -> 2026-03-10 09:04:00 | 5 failed SSH attempts from 203.0.113.40 within 10 minutes. | +| brute_force | 203.0.113.40 | 5 | 2026-03-10 15:00:00 -> 2026-03-10 15:04:00 | 5 failed SSH attempts from 203.0.113.40 within 10 minutes. | + +## Event Counts + +| Event Type | Count | +| --- | ---: | +| ssh_failed_password | 10 | + +## Parser Quality + +All analyzed lines matched a supported pattern. + +No parser failure categories were recorded. + +## Parser Warnings + +No malformed lines were skipped. diff --git a/tests/fixtures/report_contracts/separated_bursts_syslog/warnings.csv b/tests/fixtures/report_contracts/separated_bursts_syslog/warnings.csv new file mode 100644 index 0000000..07b69cf --- /dev/null +++ b/tests/fixtures/report_contracts/separated_bursts_syslog/warnings.csv @@ -0,0 +1 @@ +kind,line_number,category,message diff --git a/tests/fixtures/report_contracts/syslog_legacy/report.json b/tests/fixtures/report_contracts/syslog_legacy/report.json index 6c66fab..4f68946 100644 --- a/tests/fixtures/report_contracts/syslog_legacy/report.json +++ b/tests/fixtures/report_contracts/syslog_legacy/report.json @@ -1,7 +1,7 @@ { "tool": "LogLens", - "schema": "loglens.report.v2", - "schema_version": 2, + "schema": "loglens.report.v3", + "schema_version": 3, "input": "tests/fixtures/report_contracts/syslog_legacy/input.log", "input_mode": "syslog_legacy", "assume_year": 2026, @@ -35,8 +35,10 @@ ], "findings": [ { + "finding_id": "finding:brute_force:4e6aec401a0d45ca", "rule_id": "brute_force", "rule": "brute_force", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -51,8 +53,10 @@ "summary": "5 failed SSH attempts from 203.0.113.10 within 10 minutes." }, { + "finding_id": "finding:multi_user_probing:d63d2a332522d0e3", "rule_id": "multi_user_probing", "rule": "multi_user_probing", + "episode_index": 1, "subject_kind": "source_ip", "subject": "203.0.113.10", "grouping_key": "source_ip", @@ -67,8 +71,10 @@ "summary": "203.0.113.10 targeted 5 usernames within 15 minutes." }, { + "finding_id": "finding:sudo_burst:12c5005a84ce3296", "rule_id": "sudo_burst", "rule": "sudo_burst", + "episode_index": 1, "subject_kind": "username", "subject": "alice", "grouping_key": "username", diff --git a/tests/test_detector.cpp b/tests/test_detector.cpp index a85e28f..200b0ad 100644 --- a/tests/test_detector.cpp +++ b/tests/test_detector.cpp @@ -149,6 +149,26 @@ std::vector build_two_sudo_episode_events() { "Mar 10 15:02:00 example-host sudo: user001 : TTY=pts/0 ; PWD=/home/user/project ; USER=root ; COMMAND=/usr/bin/systemctl reload ssh\n"); } +std::vector build_bruteforce_exact_boundary_events() { + return parse_events( + make_syslog_config(), + "Mar 10 09:00:00 example-host sshd[2201]: Failed password for user001 from 203.0.113.30 port 52001 ssh2\n" + "Mar 10 09:02:00 example-host sshd[2202]: Failed password for user001 from 203.0.113.30 port 52002 ssh2\n" + "Mar 10 09:04:00 example-host sshd[2203]: Failed password for user001 from 203.0.113.30 port 52003 ssh2\n" + "Mar 10 09:06:00 example-host sshd[2204]: Failed password for user001 from 203.0.113.30 port 52004 ssh2\n" + "Mar 10 09:10:00 example-host sshd[2205]: Failed password for user001 from 203.0.113.30 port 52005 ssh2\n"); +} + +std::vector build_bruteforce_over_boundary_events() { + return parse_events( + make_syslog_config(), + "Mar 10 09:00:00 example-host sshd[2301]: Failed password for user001 from 203.0.113.31 port 52001 ssh2\n" + "Mar 10 09:02:00 example-host sshd[2302]: Failed password for user001 from 203.0.113.31 port 52002 ssh2\n" + "Mar 10 09:04:00 example-host sshd[2303]: Failed password for user001 from 203.0.113.31 port 52003 ssh2\n" + "Mar 10 09:06:00 example-host sshd[2304]: Failed password for user001 from 203.0.113.31 port 52004 ssh2\n" + "Mar 10 09:10:01 example-host sshd[2305]: Failed password for user001 from 203.0.113.31 port 52005 ssh2\n"); +} + std::vector build_publickey_bruteforce_candidate_events() { return parse_events( make_syslog_config(), @@ -310,6 +330,14 @@ void test_bruteforce_emits_multiple_episodes_for_same_source() { const auto episodes = find_findings(findings, loglens::FindingType::BruteForce, "203.0.113.10"); expect(episodes.size() == 2, "expected two brute-force episodes for the same source IP"); + expect(episodes[0]->episode_index == 1, "expected first brute-force episode index"); + expect(episodes[1]->episode_index == 2, "expected second brute-force episode index"); + expect(episodes[0]->finding_id.rfind("finding:brute_force:", 0) == 0, + "expected first brute-force stable finding id"); + expect(episodes[1]->finding_id.rfind("finding:brute_force:", 0) == 0, + "expected second brute-force stable finding id"); + expect(episodes[0]->finding_id != episodes[1]->finding_id, + "expected separated brute-force episodes to have distinct finding ids"); expect(episodes[0]->event_count == 5, "expected first brute-force episode count"); expect(episodes[0]->observed_count == 5, "expected first brute-force observed count"); expect(loglens::format_timestamp(episodes[0]->first_seen) == "2026-03-10 09:00:00", @@ -338,6 +366,14 @@ void test_multi_user_emits_multiple_episodes_for_same_source() { const auto episodes = find_findings(findings, loglens::FindingType::MultiUserProbing, "203.0.113.20"); expect(episodes.size() == 2, "expected two multi-user probing episodes for the same source IP"); + expect(episodes[0]->episode_index == 1, "expected first multi-user episode index"); + expect(episodes[1]->episode_index == 2, "expected second multi-user episode index"); + expect(episodes[0]->finding_id.rfind("finding:multi_user_probing:", 0) == 0, + "expected first multi-user stable finding id"); + expect(episodes[1]->finding_id.rfind("finding:multi_user_probing:", 0) == 0, + "expected second multi-user stable finding id"); + expect(episodes[0]->finding_id != episodes[1]->finding_id, + "expected separated multi-user episodes to have distinct finding ids"); expect(episodes[0]->event_count == 3, "expected first multi-user episode event count"); expect(episodes[0]->observed_count == 3, "expected first multi-user episode distinct username count"); expect((episodes[0]->usernames == std::vector{"user001", "user002", "user003"}), @@ -364,6 +400,14 @@ void test_sudo_burst_emits_multiple_episodes_for_same_user() { const auto episodes = find_findings(findings, loglens::FindingType::SudoBurst, "user001"); expect(episodes.size() == 2, "expected two sudo burst episodes for the same user"); + expect(episodes[0]->episode_index == 1, "expected first sudo episode index"); + expect(episodes[1]->episode_index == 2, "expected second sudo episode index"); + expect(episodes[0]->finding_id.rfind("finding:sudo_burst:", 0) == 0, + "expected first sudo stable finding id"); + expect(episodes[1]->finding_id.rfind("finding:sudo_burst:", 0) == 0, + "expected second sudo stable finding id"); + expect(episodes[0]->finding_id != episodes[1]->finding_id, + "expected separated sudo episodes to have distinct finding ids"); expect(episodes[0]->event_count == 3, "expected first sudo episode count"); expect(episodes[0]->observed_count == 3, "expected first sudo episode observed count"); expect(loglens::format_timestamp(episodes[0]->first_seen) == "2026-03-10 09:00:00", @@ -383,6 +427,62 @@ void test_sudo_burst_emits_multiple_episodes_for_same_user() { "expected second sudo episode evidence ids"); } +void test_episode_identity_is_stable_for_unsorted_input_events() { + const auto ordered_events = build_two_bruteforce_episode_events(); + const std::vector shuffled_events{ + ordered_events[7], + ordered_events[0], + ordered_events[4], + ordered_events[8], + ordered_events[1], + ordered_events[5], + ordered_events[2], + ordered_events[9], + ordered_events[3], + ordered_events[6]}; + + const loglens::Detector detector; + const auto ordered_findings = detector.analyze(ordered_events); + const auto shuffled_findings = detector.analyze(shuffled_events); + const auto ordered_episodes = find_findings(ordered_findings, loglens::FindingType::BruteForce, "203.0.113.10"); + const auto shuffled_episodes = find_findings(shuffled_findings, loglens::FindingType::BruteForce, "203.0.113.10"); + + expect(ordered_episodes.size() == 2, "expected ordered input to produce two brute-force episodes"); + expect(shuffled_episodes.size() == 2, "expected shuffled input to produce two brute-force episodes"); + for (std::size_t index = 0; index < ordered_episodes.size(); ++index) { + expect(ordered_episodes[index]->episode_index == shuffled_episodes[index]->episode_index, + "expected shuffled input to preserve episode index"); + expect(ordered_episodes[index]->finding_id == shuffled_episodes[index]->finding_id, + "expected shuffled input to preserve stable finding id"); + expect(ordered_episodes[index]->first_seen == shuffled_episodes[index]->first_seen, + "expected shuffled input to preserve episode start"); + expect(ordered_episodes[index]->last_seen == shuffled_episodes[index]->last_seen, + "expected shuffled input to preserve episode end"); + } + + auto relabeled_episode = *ordered_episodes[0]; + relabeled_episode.episode_index = 99; + expect(loglens::build_finding_id(relabeled_episode) == ordered_episodes[0]->finding_id, + "expected finding id to remain independent of episode index"); +} + +void test_bruteforce_window_boundary_is_inclusive() { + const loglens::Detector detector; + + const auto exact_findings = detector.analyze(build_bruteforce_exact_boundary_events()); + const auto exact = find_findings(exact_findings, loglens::FindingType::BruteForce, "203.0.113.30"); + expect(exact.size() == 1, "expected exact ten-minute boundary to count inside the window"); + expect(exact[0]->event_count == 5, "expected exact-boundary brute-force count"); + expect(loglens::format_timestamp(exact[0]->first_seen) == "2026-03-10 09:00:00", + "expected exact-boundary window start"); + expect(loglens::format_timestamp(exact[0]->last_seen) == "2026-03-10 09:10:00", + "expected exact-boundary window end"); + + const auto over_findings = detector.analyze(build_bruteforce_over_boundary_events()); + const auto over = find_findings(over_findings, loglens::FindingType::BruteForce, "203.0.113.31"); + expect(over.empty(), "expected one second over the window to stay below threshold"); +} + void test_auth_signal_defaults() { const auto events = parse_events( make_syslog_config(), @@ -637,6 +737,8 @@ int main() { test_bruteforce_emits_multiple_episodes_for_same_source(); test_multi_user_emits_multiple_episodes_for_same_source(); test_sudo_burst_emits_multiple_episodes_for_same_user(); + test_episode_identity_is_stable_for_unsorted_input_events(); + test_bruteforce_window_boundary_is_inclusive(); test_auth_signal_defaults(); test_failed_publickey_contributes_to_bruteforce_by_default(); test_accepted_publickey_success_stays_out_of_failure_signals(); diff --git a/tests/test_parser.cpp b/tests/test_parser.cpp index 4b15cad..e3a980d 100644 --- a/tests/test_parser.cpp +++ b/tests/test_parser.cpp @@ -926,6 +926,28 @@ void test_parser_failure_taxonomy() { "expected fifth warning category"); } +void test_malformed_source_ip_token_corpus() { + const auto parser = make_syslog_parser(); + const std::vector malformed_tokens{ + "not_an_ip", + "999.0.113.10", + "203.0.113.300", + "203.0.113.10,"}; + + for (std::size_t index = 0; index < malformed_tokens.size(); ++index) { + const auto line = "Mar 10 08:00:20 example-host sshd[1002]: Failed password for root from " + + malformed_tokens[index] + " port 50101 ssh2"; + std::string error; + loglens::ParserFailureCategory category = loglens::ParserFailureCategory::KnownProgramUnknownMessage; + const auto event = parser.parse_line(line, index + 1, &error, &category); + + expect(!event.has_value(), "expected malformed source token to stay out of normalized events"); + expect(category == loglens::ParserFailureCategory::MalformedSourceIp, + "expected malformed source token to use malformed_source_ip category"); + expect(error == "malformed source IP", "expected malformed source token reason"); + } +} + void test_unknown_auth_patterns_are_warnings_only() { const auto parser = make_syslog_parser(); std::istringstream input( @@ -1336,6 +1358,7 @@ int main() { test_journalctl_auth_family_fixture_file(); test_malformed_line(); test_parser_failure_taxonomy(); + test_malformed_source_ip_token_corpus(); test_unknown_auth_patterns_are_warnings_only(); test_stream_warnings_and_metadata(); test_stream_tracks_skipped_blank_lines(); diff --git a/tests/test_report.cpp b/tests/test_report.cpp index c7d4990..4e35b6d 100644 --- a/tests/test_report.cpp +++ b/tests/test_report.cpp @@ -171,8 +171,12 @@ void test_json_finding_includes_explainability_fields() { const auto json = loglens::render_json_report(data); + expect(json.find("\"finding_id\": \"finding:sudo_burst:") != std::string::npos, + "expected json finding to include stable finding id"); expect(json.find("\"rule_id\": \"sudo_burst\"") != std::string::npos, "expected json finding to include rule id"); + expect(json.find("\"episode_index\": 1") != std::string::npos, + "expected json finding to include episode index"); expect(json.find("\"grouping_key\": \"username\"") != std::string::npos, "expected json finding to include grouping key"); expect(json.find("\"threshold\": 3") != std::string::npos, @@ -190,9 +194,9 @@ void test_json_finding_includes_explainability_fields() { void test_json_report_includes_schema_identity() { const auto json = loglens::render_json_report(make_report_data()); - expect(json.find("\"schema\": \"loglens.report.v2\"") != std::string::npos, + expect(json.find("\"schema\": \"loglens.report.v3\"") != std::string::npos, "expected json report to include schema identifier"); - expect(json.find("\"schema_version\": 2") != std::string::npos, + expect(json.find("\"schema_version\": 3") != std::string::npos, "expected json report to include schema version"); } diff --git a/tests/test_report_contracts.cpp b/tests/test_report_contracts.cpp index b9727fe..1662f98 100644 --- a/tests/test_report_contracts.cpp +++ b/tests/test_report_contracts.cpp @@ -159,8 +159,10 @@ std::vector extract_json_contract_lines(const std::string& json) { || starts_with(line, "{\"pattern\": ") || starts_with(line, "{\"category\": ") || starts_with(line, "{\"event_type\": ") + || starts_with(line, "\"finding_id\": ") || starts_with(line, "\"rule_id\": ") || starts_with(line, "\"rule\": ") + || starts_with(line, "\"episode_index\": ") || starts_with(line, "\"subject_kind\": ") || starts_with(line, "\"subject\": ") || starts_with(line, "\"grouping_key\": ") @@ -337,6 +339,12 @@ int main(int argc, char* argv[]) { fixture_root / "multi_host_journalctl_short_full", output_root, "journalctl-short-full"); + run_report_contract_case( + loglens_exe, + fixture_root / "separated_bursts_syslog", + output_root, + "syslog", + "--year 2026"); run_report_contract_case( loglens_exe, fixture_root / "syslog_legacy", @@ -365,6 +373,13 @@ int main(int argc, char* argv[]) { "journalctl-short-full", "--csv", true); + run_report_contract_case( + loglens_exe, + fixture_root / "separated_bursts_syslog", + output_root, + "syslog", + "--year 2026 --csv", + true); } catch (...) { std::filesystem::current_path(original_cwd); throw;