Taming Tenancy, Cost and Architecture at Collibra

Through OpenTelemetry and our Telemetry Backbone

Almost 30 years in the sector

 

Mostly as Software Engineer

Web - 3D - Middleware - Mobile - Big Data

 

More recent as Architect

Data - SRE - Infrastructure

 

Community

Apache Beam contributor

OpenTelemetry Collector contributor

 

Collibra

Principal Systems Architect

Alex
Van Boxel

A data intelligence platform powered by active metadata

 

AI Governance

Data Catalog

Data Governance

Data Lineage

Data Notebook

Data Privacy

Data Quality & Observability

Protect

"How much does X cost?"

"Compare the Usage and Reservation?"

Trend analysis per tenant

and Collection at the Edge

Architecture

Collibra Architecture

  • On-prem heritage: single deployable monolith → hosted SaaS on VMs (single-tenant isolation for free)
  • Kubernetes shift: microservices for team velocity & polyglot (Python now dominant in AI)
  • The pivot: "container per service per tenant" was unsustainable → shared multi-tenant on K8s

Shared multi-tenancy saves cost — but makes it harder to figure out the cost per tenant... how can we solve this?

 

B

Collector (VM)

Collector(s) on the VM. This could be multiple (eg. one per signal)

A

Collector (node)

Collector(s) installed as deamonsets.

C

Collector (cluster)

Cluster wide collectors not relevant to the per node workloads

D

Collector (ingress)

A collector hooked into the ingress gateway on a specific path, to capture telemetry from the browser and our edge.

21

Pub/Sub

Queuing system is an essential part of the backbone

OpenTelemetry Attributes

Golden Signals

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

{
  "event_name": "workflow:started",
  "tenant_environment_id": "...",
  "asset_id": "..."
}

CSTE - Collibra Structured Telemetry Event: Events are our golden signal

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

{
  "event_name": "workflow:started",
  "tenant_environment_id": "...",
  "asset_id": "..."
}
MDC.put("tenant_environment_id",
        ctx.getTenantEnvironmentId());
try {
  // all logs in this thread
} finally {
  MDC.clear();
}

Multi-tenant service? Dev's responsibility to add signals in code, eg. Mapped Diagnostic Context

Gold Attr. #2: Architecture 

https://c4model.com/ - The C4 model is an easy to learn, developer friendly approach to software architecture diagramming (by Simon Brown)

  • System - logical product capability
  • Container - service, logical database, topic, module
  • Deployment Node - where it runs (can nest)

Gold Attr. #2: Architecture 

https://c4model.com/ - The C4 model is an easy to learn, developer friendly approach to software architecture diagramming (by Simon Brown)

  • System - logical product capability
  • Container - service, logical database, topic, module
  • Deployment Node - where it runs (can nest)
collibra.c4.system
collibra.c4.container
collibra.c4.deployment

Gold Attr. #2: Architecture 

  • Pod Resource Attributes - Easy with 1:1 mapping

 

 

 

  • Modular Monoliths Signal Attributes - It's not only out single tenant core, but also k8s jobs
labels:
  c4.collibra.com/system: telemetry
  c4.collibra.com/container: colkyverno
collibra.c4.system: telemetry
collibra.c4.container: colkyverno

Modular Monoliths ( it becomes the resposability for devs

Enrichment and Routing

Telemetry Backbone

21

Pub/Sub

Queuing system is an essential part of the backbone

8

Master Data

Can be sourced from different systems to merge into the data

7

Pipelines and Backends

Paralel pipelines do the processing, enrichment, filtering, calculation and backup to our backends

In-Flight Enrichment from Master Data

  • JSON field promotion: body fields → signal OpenTelemetry attributes
  • Master data lookup: keyed by collibra.tenant.environment_id

Devs don't need to know contract terms or support levels — they just log the tenant environment ID, and the backbone dynamically infers and injects the rest.

OpenAPI Reverse-Mapping

  • URL cardinality explosion bloats metric DBs and costs
  • Reverse-map Istio URL + method → OpenAPI operationId
  • Low-cardinality, semantic endpoint stream → automated SLOs across all microservices
  • Also: aggressively drop runtime spam & infra-sweep noise before vendors see it

"We don't measure URLs. We measure contracts."

 

3

OTLP Backup

Backup of the raw data, on cheap storage.

5

Batch Imports

We import into our data lake in batch as it's cost efficient.

9

Data lake

Our data lake is where all the calculations are done for reporting, including cost attribution.

 

Retention is infinit.

Cost Attribution - Closing the Loop

  • Telemetry volume cost — aggregate signal volume per C4 system × tenant
  • Compute cost slicing — CPU / mem / disk / network by tenant and C4
  • C4-aware provisioning — Collibra Infra CRDs carry C4 metadata; cloud billing maps to logical owner

Open problem: defensible "virtual dollar" formula for cross-team chargebacks.

and wiring

Semantic Conventions

Semantic Conventions

Semantic Conventions

Wiring - SemConv + Weaver

Wiring - More YAML

and takeaway

Future

OpAMP — Pushing Control to the Collection Edge

  • Bandwidth problem: backbone filtering saves on vendors, but raw telemetry still costs WAN egress
  • OpAMP: dynamic management & configuration of the entire collector fleet
  • Adaptive edge sampling: normal tenant → aggressive sampling; incident → dial up fidelity at source

 

Key Takeaways

① Golden attributes on day one
Define tenancy and architecture dimensions before you split into microservices, not after.

 

② Decouple with a backbone
Buffer-first ingestion (Pub/Sub) + centralized enrichment unlocks both ops and FinOps / BI.

 

③ Invest in semantic contracts
They structure your signals today and become the foundation for AI diagnostic agents tomorrow.