Skip to content

feat(otel-collector): add OpenTelemetryCollector CRD, controller, and render#4979

Draft
tianfeng92 wants to merge 17 commits into
tigera:masterfrom
tianfeng92:PMREQ-822-otel-collector
Draft

feat(otel-collector): add OpenTelemetryCollector CRD, controller, and render#4979
tianfeng92 wants to merge 17 commits into
tigera:masterfrom
tianfeng92:PMREQ-822-otel-collector

Conversation

@tianfeng92

Copy link
Copy Markdown
Contributor

Summary

  • Adds OpenTelemetryCollector CRD with support for logs (OTLP receiver) and metrics (Prometheus receiver) pipelines
  • Implements controller with license gating, mTLS certificate management, and deployment override validation
  • Renders StatefulSet with dynamically generated OTel Collector config (receivers, processors, exporters, pipelines) based on CR spec
  • Extends LogCollector types to carry OTel-related fields for fluent-bit integration
  • Uses combined calico mono-image (CombinedCalicoImage) consistent with upstream refactor

Changes

Area Files
CRD types api/v1/otelcollector_types.go, api/v1/logcollector_types.go, zz_generated.deepcopy.go
Controller pkg/controller/otelcollector/controller.go, internal/controller/
Render pkg/render/otelcollector/component.go
Validation pkg/common/validation/otelcollector/validation.go
LogCollector integration pkg/controller/logcollector/logcollector_controller.go
Tests pkg/render/otelcollector/component_test.go, pkg/controller/otelcollector/otelcollector_controller_test.go
Generated CRDs pkg/imports/crds/operator/

Test plan

  • Render tests pass (22 cases): make ut UT_DIR=./pkg/render/otelcollector
  • Controller tests pass (6 cases): make ut UT_DIR=./pkg/controller/otelcollector
  • Pre-commit hooks (goimports, formatting) pass
  • Full CI (make ci)
  • Deploy to test cluster and verify collector receives logs/metrics

🤖 Generated with Claude Code

@tianfeng92 tianfeng92 requested review from a team and marvin-tigera as code owners June 29, 2026 19:52
@marvin-tigera marvin-tigera added this to the v1.44.0 milestone Jun 29, 2026
@tianfeng92 tianfeng92 marked this pull request as draft June 29, 2026 19:52
hjiawei and others added 15 commits June 30, 2026 16:37
Replace the fluentd DaemonSet with fluent-bit for log collection and
forwarding. The LogCollector controller now renders the calico-fluent-bit
DaemonSet (Linux and Windows) in calico-system, and pkg/render/fluentd.go is
replaced by fluentbit.go.

- Ship fluent-bit logs to Linseed through its built-in http output.
- Rename the FluentdDaemonSet* API types to FluentBitDaemonSet*
  (fluentd_daemonset_types.go -> fluentbit_daemonset_types.go). Preserve the
  deprecated fluentdDaemonSet override field name/json tag as an alias and
  widen its enums to accept both the new calico-fluent-bit* names and the
  legacy fluentd names so existing LogCollector specs still validate;
  translateLegacyFluentdOverrides remaps the legacy names.
- Warn on invalid fluent-bit-filters ConfigMap content (e.g. left in fluentd
  <filter> syntax) instead of silently dropping it.
- Drop the bogus "calico-fluent-bit" entry from the manager cluster-wide
  namespace list; fluent-bit runs in calico-system, which is already listed.
- Regenerate deepcopy, the operator CRD and enterprise versions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jiawei Huang <jiawei@tigera.io>
Forward flow, DNS and policy-activity logs from non-cluster hosts through
voltron to the in-cluster calico-fluent-bit http input, and on to Linseed.

- Grant dnslogs (alongside flowlogs and policyactivity) on the
  non-cluster-host ClusterRole so the minted host token passes voltron's
  SubjectAccessReview for the DNS ingestion path instead of 403ing.
- Set VOLTRON_LOG_COLLECTOR_CA_BUNDLE_PATH on the manager so voltron verifies
  the calico-fluent-bit http input's TLS server certificate against the
  trusted CA bundle it already mounts (the config default
  /etc/pki/tls/certs/ca.crt is not mounted, so the handshake otherwise fails).
- Pass NonClusterHost to the Windows fluent-bit configuration so the Linux
  and Windows renders produce the shared allow-calico-fluent-bit NetworkPolicy
  identically. Otherwise, on clusters with Windows nodes, the port-9880
  ingress rule (voltron -> http input) flapped on every reconcile and
  intermittently dropped voltron's access. Adds a controller regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jiawei Huang <jiawei@tigera.io>
pkg/render/fluentbit.go had grown to ~1900 lines. Move the fluent-bit /
EKS log-forwarder rendering into a new pkg/render/logcollector package,
split across focused files (logcollector core, config, outputs, daemonset,
rbac, networkpolicy, eks_log_forwarder) plus the moved tests.

A small set of symbols stays in package render (new pkg/render/logcollector.go)
to avoid a render -> render/logcollector import cycle, since Guardian, Manager,
compliance, apiserver, dex and intrusion detection reference them: the
log-collector network-policy identity (FluentBitSourceEntityRule,
EKSLogForwarderEntityRule, LogCollectorNamespace, the fluent-bit node names,
FluentBitInputService), the shared Linseed-token constants, and the
TrustedBundleVolume helper. The logcollector package aliases these. The shared
pod helper setNodeCriticalPod is exported as SetNodeCriticalPod (matching its
sibling SetClusterCriticalPod).

Pure code move; no behavior change. Build, vet, unit tests, format-check and
gen-files all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jiawei Huang <jiawei@tigera.io>
… render

Add the full OTel Collector operator support: CRD types, controller with
license gating, render component, deployment override validation, fluentd
integration, and unit tests for all layers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the fluentforward protocol with a custom fluentdhttp receiver
that accepts Fluentd's out_http JSON format, enabling future mTLS support
via the OTel confighttp framework.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pipeline

Route metrics from prometheus receiver to a dedicated prometheusremotewrite
exporter instead of sharing log exporters. Add Prometheus port 9090 to
network policy egress rules when metrics are enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tests

Verify the metrics pipeline uses the dedicated prometheusremotewrite
exporter and that the network policy includes Prometheus port egress
when metrics are enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd mTLS

Add prometheus receiver with Kubernetes SD and mTLS for scraping calico-node
metrics. Render TLS volume mounts from certificate manager. Switch config
generation from string builder to Go template for maintainability. Add
memory_limiter processor and resource limits. Wire OTel log types through
to fluent-bit output rendering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fluent-bit's native opentelemetry output plugin is already compiled in,
so we skip the bridge phase and go straight to OTLP end-to-end. Remove
LogForwarderProtocol abstraction, FluentForwardPort, and fluentdhttp
references. Fluent-bit now uses out_opentelemetry targeting port 4318.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard body field access with IsMap(body) to prevent "log bodies of
type Str cannot be indexed" warnings when logs arrive as plain strings.
Simplify audit classification to match all audit logs by auditID only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…te otel-collector image

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n and logcollector controller

The OTel collector has its own controller and render path — the fluentd
render code doesn't need OTelCollectorEnabled or OTelLogTypes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…metrics, and dynamic egress rules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tianfeng92 tianfeng92 force-pushed the PMREQ-822-otel-collector branch from 69eaaa5 to 48c6ba9 Compare June 30, 2026 23:37
Action: v3.Allow,
Protocol: &networkpolicy.TCPProtocol,
Destination: v3.EntityRule{
Ports: networkpolicy.Ports(uint16(p)),
…ess rule

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…prometheus

The "no metrics disabled" test checked for absence of "prometheus:" which
now always appears in the telemetry.metrics.readers block. Assert on
"scrape_configs:" instead, which is specific to the prometheus receiver.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants