Kafka: A Crash Course

Me

Data Platform @ Pluralsight
@lewisjkl on Twitter
Questions in UGE Slack

Outline

What is Kafka?
Administering Kafka
Producing to Kafka
Consuming from Kafka

What is Kafka?

Pub Sub!

Basically...

A distributed log

The Log Data Structure

Append only, immutable, ordered sequence.

...

Distributed

Kafka is run as a cluster of brokers.

Topic
Partition
Replication

Topic

A group of one or more partitions

Partition

A single log

...

Which partition does my record go inside of?

{key, value}

hash(key) % count(partitions)

Partition 0

Partition 1

Partition 2

{
    "key": "test",
    "value": {...}
}

def hash(input: String): Long = ???

val keyHash = hash(key) // Let's say this returns 100, for example
val numOfPartitions = 3

val destinationPartition = keyHash % numOfPartitions // 1

Partition 0

Partition 1

Partition 2

Replication

Each partition is stored on n different brokers where n is the replication factor

If you have 5 brokers and a replication factor of 3, each partition will be stored on 3 different brokers.

Fault tolerance, yay!

Administering Kafka

Don't self-host unless you have a REALLY GOOD REASON

Confluent Cloud

AWS MSK

Many Others

What should my Kafka cluster "look" like?

How much data will you have in your cluster at once?

Example: 30TB cluster with 5TB capacity per node.. Min of 6 nodes

Replication factor of 2.. now you need 12 nodes!

Okay, but what if I am just getting started? I don't need 30TB of capacity!

My recommendation...

If you are low-key: 3 brokers with replication factor of 2.

If you have high availability requirements: 5 brokers with replication factor of 3.

Good configs to be aware of

auto.create.topics.enable

My advice: set it to false!

delete.topic.enable

Set to false IF you don't need to delete topics (very often)

message.max.bytes

Probably don't change this.

Just be aware that it is ~1MB (1048588 bytes)

message.max.bytes

Probably don't change this.

Just be aware that it is ~1MB (1048588 bytes)

Retention Policies

Two Options:

1) Time Based

2) Size Based (less common)

3) Infinite Retention?!?!

Tiered Storage 4 The Win!

But what about compaction???

Kafka can "clean up" topics with deletion (time based or size based) OR compaction

(OR both, actually)

{
  key: one
  value: {...}
}

{
  key: two
  value: {...}
}

{
  key: one
  value: {...}
}

{
  key: three
  value: {...}
}

{
  key: one
  value: {...}
}

Before Compaction

{
  key: one
  value: {...}
}

{
  key: two
  value: {...}
}

{
  key: one
  value: {...}
}

{
  key: three
  value: {...}
}

{
  key: one
  value: {...}
}

After Compaction

{
  key: one
  value: {...}
}

{
  key: two
  value: {...}
}

{
  key: one
  value: {...}
}

{
  key: three
  value: {...}
}

{
  key: one
  value: {...}
}

Deleting a Record

{
  key: three
  value: null
}

Is compaction right for me?!?

Maybe.

log.cleaner.min.compaction.lag.ms

Good to be aware of.. if this is set too low, consumers may not ever get to see a record. You may or may not care 🤷‍♀️

What format should I store my data in?

Records in Kafka are just BYTES

Common Data Formats:

JSON
Avro
protobuf
CSV.. Noooooo!

JSON w/o Schemas

PROS

Easy to work with

CONS

No schemas
Space HOG

JSON with Schemas

PROS

Easy to work with
Schemas including rules for evolutions between schemas

CONS

Space HOG

Avro

PROS

Schemas including rules for evolutions between schemas
Plenty of tooling around Kafka
Very compact

CONS

Harder to work with

Protobuf

PROS

Schemas including rules for evolutions between schemas
Very compact

CONS

Harder to work with

CSV

So which one should I choose??!

It depends.

Of Course!

But most of the time, Avro or Protobuf. Pick your poison.

Schema Registry

thx Confluent

Unless you are going to have very few topics with very few people working with Kafka...

Use Schema Registry!

Compatibility Strategies

BACKWARD (the default)
FORWARD
FULL
*_TRANSITIVE

{
  "type": "record",
  "name": "Score",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "score",
      "type": "double"
    }
  ]
}

{
  "type": "record",
  "name": "Score",
  "fields": [
    {
      "name": "id",
      "type": "int"
    }
  ]
}

Schema Version 1

Schema Version 2

Don't leave your consumers hanging!!!

FULL_TRANSITIVE

Your new best friend and my personal recommendation

FULL_TRANSITIVE

Allows you to add and/or delete optional fields. That's it!
Decouples Producers from Consumers
Transitive because otherwise, you can just disregard rules if you are crafty

If you want to dig into this more: go here!

https://docs.confluent.io/current/schema-registry/avro.html

One final note.. Use the same format for both key and value

Oh also, node level schema validation!

Security... blah

If you KNOW you'll need it, then do it up front!

Your options:

SSL (for encryption and authentication)
SASL (Authentication)
ACLs (Authorization)
RBAC (If you use Confluent)

If you are enterprise, then probably just go with Confluent RBAC w/LDAP

Metrics... JMX

Recap!

Cluster shape and size
Some good configs to look at
Retention and Compaction
Data Formats
Security

Producers

Broker 1

Partition 1

Partition 2

Partition 3

Producer

Broker 2

Partition 1

Partition 2

Partition 3

Broker 3

Partition 1

Partition 2

Partition 3

Leader

Configs that you may want to look at

key/value.serializer

acks

The number of brokers that need to acknowledge a record as being received.

Default is 1, I would set it to all!

bootstrap.servers

example: `host1:port,host2:port`

These are only used to discover the Kafka cluster, not for repeated connections

retries & max.in.flight.requests.per.connection

From the docs: "Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records..."

compression.type

Consider enabling compression, especially if you have high volume of data

batch.size

Throughput

Latency

Consumers

Consumer Groups

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Configs of which to be aware

auto.offset.reset

Either "earliest" or "latest"

Tells the consumer where to start consuming from when no offsets have been committed.

enable.auto.commit

Defaults to true

Many times, you want to control when you commit offsets so you would want this to be false.

enable.auto.commit

Defaults to true

Many times, you want to control when you commit offsets so you would want this to be false.

Other consumer types

Kafka Connect - Replication
Spark/Flink/etc - Stream Processing (joins, etc)

Kafka: A Crash Course

Me

Outline

What is Kafka?

Pub Sub!

Basically...

The Log Data Structure

...

Distributed

Topic Partition Replication

Topic

A group of one or more partitions

Partition

A single log

...

Which partition does my record go inside of?

{key, value}

hash(key) % count(partitions)

Replication

Each partition is stored on n different brokers where n is the replication factor

If you have 5 brokers and a replication factor of 3, each partition will be stored on 3 different brokers.

Administering Kafka

Don't self-host unless you have a REALLY GOOD REASON

Confluent Cloud

AWS MSK

Many Others

What should my Kafka cluster "look" like?

How much data will you have in your cluster at once?

Example: 30TB cluster with 5TB capacity per node.. Min of 6 nodes

Replication factor of 2.. now you need 12 nodes!

Okay, but what if I am just getting started? I don't need 30TB of capacity!

My recommendation...

Good configs to be aware of

auto.create.topics.enable

My advice: set it to false!

delete.topic.enable

Set to false IF you don't need to delete topics (very often)

message.max.bytes

Probably don't change this.

message.max.bytes

Probably don't change this.

Retention Policies

Two Options:

Tiered Storage 4 The Win!

But what about compaction???

Kafka can "clean up" topics with deletion (time based or size based) OR compaction

Is compaction right for me?!?

log.cleaner.min.compaction.lag.ms

Good to be aware of.. if this is set too low, consumers may not ever get to see a record. You may or may not care 🤷‍♀️

What format should I store my data in?

Records in Kafka are just BYTES

Common Data Formats:

JSON w/o Schemas

JSON with Schemas

Avro

Protobuf

CSV

So which one should I choose??!

It depends.

But most of the time, Avro or Protobuf. Pick your poison.

Schema Registry

Unless you are going to have very few topics with very few people working with Kafka...

Use Schema Registry!

Compatibility Strategies

FULL_TRANSITIVE

FULL_TRANSITIVE

If you want to dig into this more: go here!

One final note.. Use the same format for both key and value

Oh also, node level schema validation!

Security... blah

If you KNOW you'll need it, then do it up front!

Your options:

If you are enterprise, then probably just go with Confluent RBAC w/LDAP

Metrics... JMX

Recap!

Producers

Configs that you may want to look at

key/value.serializer

acks

bootstrap.servers

Topic
Partition
Replication