Skip to content

feat(theta): add jaccard similarity#142

Open
hawkingrei wants to merge 5 commits into
apache:mainfrom
hawkingrei:feat/theta-jaccard-similarity
Open

feat(theta): add jaccard similarity#142
hawkingrei wants to merge 5 commits into
apache:mainfrom
hawkingrei:feat/theta-jaccard-similarity

Conversation

@hawkingrei

Copy link
Copy Markdown

Summary

  • Add ThetaJaccardSimilarity for Theta sketches.
  • Return lower bound, estimate, and upper bound for the Jaccard index.
  • Port core coverage from the C++ theta_jaccard_similarity_test.cpp cases.

Motivation

This closes one small P0 parity gap with apache/datasketches-cpp: Rust had Theta sketching and intersection support, but no Jaccard similarity API.

Implementation Notes

  • Reuses existing ThetaIntersection for the intersection side.
  • Builds the two-input union locally inside the Jaccard implementation without exposing a full ThetaUnion API in this PR.
  • Ports the C++ sampled-ratio bound approximation locally for Jaccard bounds.

Tests

  • cargo check -p datasketches --features theta
  • cargo test -p datasketches --features theta --test theta_jaccard_similarity_test
  • cargo test -p datasketches --features theta --lib --test theta_intersection_test --test theta_jaccard_similarity_test --test theta_sketch_test

Full cargo test -p datasketches --features theta currently reaches unrelated failures in theta_serialization_test because local datasketches/tests/serialization_test_data/... files are missing.

@tisonkun tisonkun requested a review from ZENOTME July 2, 2026 09:41
tisonkun added 2 commits July 2, 2026 18:13
Signed-off-by: tison <wander4096@gmail.com>
}
}

fn compute_union<A: ThetaSketchView, B: ThetaSketchView>(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we better have union implementation here first.

}
}

fn approximate_lower_bound_on_p(n: u64, k: u64, num_std_devs: f64) -> f64 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are general math function which can be place in bounds_binomial_proportions.rs

@ZENOTME

ZENOTME commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

We also miss exactly_equal but it's ok to postpone it at next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants