Taming Tenancy, Cost and Architecture at Collibra

Through OpenTelemetry and our Telemetry Backbone

Almost 30 years in the sector

 

Mostly as Software Engineer

Web - 3D - Middleware - Mobile - Big Data

 

More recent as Architect

Data - SRE - Infrastructure

 

Community

Apache Beam contributor

OpenTelemetry Collector contributor

 

Collibra

Principal Systems Architect

Alex
Van Boxel

A data intelligence platform powered by active metadata

 

AI Governance

Data Catalog

Data Governance

Data Lineage

Data Notebook

Data Privacy

Data Quality & Observability

Protect

"How much does X cost?"

"Compare the Usage and Reservation?"

Trend analysis per tenant

and Collection at the Edge

Architecture

Collibra Architecture

  • On-prem heritage: single deployable monolith → hosted SaaS on VMs (single-tenant isolation for free)
  • Kubernetes shift: microservices for team velocity & polyglot (Python now dominant in AI)
  • The pivot: "container per service per tenant" was unsustainable → shared multi-tenant on K8s

Shared multi-tenancy saves cost — but makes it harder to figure out the cost per tenant... how can we solve this?

 

B

Collector (VM)

Collector(s) on the VM. This could be multiple (eg. one per signal)

A

Collector (node)

Collector(s) installed as deamonsets.

C

Collector (cluster)

Cluster wide collectors not relevant to the per node workloads

D

Collector (ingress)

A collector hooked into the ingress gateway on a specific path, to capture telemetry from the browser and our edge.

21

Pub/Sub

Queuing system is an essential part of the backbone

OpenTelemetry Attributes

Golden Signals

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

{
  "event_name": "workflow:started",
  "tenant_environment_id": "...",
  "asset_id": "..."
}

CSTE - Collibra Structured Telemetry Event: Events are our golden signal

Gold Attr. #1: Tenant 

collibra.tenant.environment_id
  • VMs Resource Attributes - Configurated at the collector
  • Pod Resource Attributes - For single tenant pods
  • Multi-tenant Pod - Signal Attributes

 

{
  "event_name": "workflow:started",
  "tenant_environment_id": "...",
  "asset_id": "..."
}
MDC.put("tenant_environment_id",
        ctx.getTenantEnvironmentId());
try {
  // all logs in this thread
} finally {
  MDC.clear();
}

Multi-tenant service? Dev's responsibility to add signals in code, eg. Mapped Diagnostic Context

Gold Attr. #2: Architecture 

https://c4model.com/ - The C4 model is an easy to learn, developer friendly approach to software architecture diagramming (by Simon Brown)

  • System - logical product capability
  • Container - service, logical database, topic, module
  • Deployment Node - where it runs (can nest)

Gold Attr. #2: Architecture 

https://c4model.com/ - The C4 model is an easy to learn, developer friendly approach to software architecture diagramming (by Simon Brown)

  • System - logical product capability
  • Container - service, logical database, topic, module
  • Deployment Node - where it runs (can nest)
collibra.c4.system
collibra.c4.container
collibra.c4.deployment

Gold Attr. #2: Architecture 

  • Pod Resource Attributes - Easy with 1:1 mapping

 

 

 

  • Modular Monoliths Signal Attributes - It's not only out single tenant core, but also k8s jobs
labels:
  c4.collibra.com/system: telemetry
  c4.collibra.com/container: colkyverno
collibra.c4.system: telemetry
collibra.c4.container: colkyverno

Modular Monoliths ( it becomes the resposability for devs

Enrichment and Routing

Telemetry Backbone

21

Pub/Sub

Queuing system is an essential part of the backbone

8

Master Data

Can be sourced from different systems to merge into the data

7

Pipelines and Backends

Paralel pipelines do the processing, enrichment, filtering, calculation and backup to our backends

In-Flight Enrichment from Master Data

  • JSON field promotion: body fields → signal OpenTelemetry attributes
  • Master data lookup: keyed by collibra.tenant.environment_id

Devs don't need to know contract terms or support levels — they just log the tenant environment ID, and the backbone dynamically infers and injects the rest.

OpenAPI Reverse-Mapping

  • URL cardinality explosion bloats metric DBs and costs
  • Reverse-map Istio URL + method → OpenAPI operationId
  • Low-cardinality, semantic endpoint stream → automated SLOs across all microservices
  • Also: aggressively drop runtime spam & infra-sweep noise before vendors see it

"We don't measure URLs. We measure contracts."

 

3

OTLP Backup

Backup of the raw data, on cheap storage.

5

Batch Imports

We import into our data lake in batch as it's cost efficient.

9

Data lake

Our data lake is where all the calculations are done for reporting, including cost attribution.

 

Retention is infinit.

Cost Attribution - Closing the Loop

  • Telemetry volume cost — aggregate signal volume per C4 system × tenant
  • Compute cost slicing — CPU / mem / disk / network by tenant and C4
  • C4-aware provisioning — Collibra Infra CRDs carry C4 metadata; cloud billing maps to logical owner

Open problem: defensible "virtual dollar" formula for cross-team chargebacks.

and wiring

Semantic Conventions

Semantic Conventions

Semantic Conventions

Wiring - SemConv + Weaver

Wiring - More YAML

and takeaway

Future

OpAMP — Pushing Control to the Collection Edge

  • Bandwidth problem: backbone filtering saves on vendors, but raw telemetry still costs WAN egress
  • OpAMP: dynamic management & configuration of the entire collector fleet
  • Adaptive edge sampling: normal tenant → aggressive sampling; incident → dial up fidelity at source

 

Key Takeaways

① Golden attributes on day one
Define tenancy and architecture dimensions before you split into microservices, not after.

 

② Decouple with a backbone
Buffer-first ingestion (Pub/Sub) + centralized enrichment unlocks both ops and FinOps / BI.

 

③ Invest in semantic contracts
They structure your signals today and become the foundation for AI diagnostic agents tomorrow.

Taming Tenancy, Cost and Architecture at Collibra

By Alex Van Boxel

Taming Tenancy, Cost and Architecture at Collibra

  • 10