Kafka: A Crash Course

Me

  • Data Platform @ Pluralsight
  • @lewisjkl on Twitter
  • Questions in UGE Slack

Outline

  • What is Kafka?
  • Administering Kafka
  • Producing to Kafka
  • Consuming from Kafka

What is Kafka?

Pub Sub!

Basically...

A distributed log

The Log Data Structure

Append only, immutable, ordered sequence.

...

0

1

2

3

n

Distributed

Kafka is run as a cluster of brokers.

Topic
Partition
Replication

Topic

A group of one or more partitions

Partition

A single log

...

0

1

2

3

n

Which partition does my record go inside of?

{key, value}

hash(key) % count(partitions)

Partition 0

Partition 1

Partition 2

{
    "key": "test",
    "value": {...}
}
def hash(input: String): Long = ???

val keyHash = hash(key) // Let's say this returns 100, for example
val numOfPartitions = 3

val destinationPartition = keyHash % numOfPartitions // 1

Partition 0

Partition 1

Partition 2

Replication

Each partition is stored on n different brokers where n is the replication factor

If you have 5 brokers and a replication factor of 3, each partition will be stored on 3 different brokers.

 

Fault tolerance, yay!

Administering Kafka

Don't self-host unless you have a REALLY GOOD REASON

Confluent Cloud

AWS MSK

Many Others

What should my Kafka cluster "look" like?

How much data will you have in your cluster at once?

Example: 30TB cluster with 5TB capacity per node.. Min of 6 nodes

Replication factor of 2.. now you need 12 nodes!

Okay, but what if I am just getting started? I don't need 30TB of capacity!

My recommendation...

If you are low-key: 3 brokers with replication factor of 2.

If you have high availability requirements: 5 brokers with replication factor of 3.

Good configs to be aware of

auto.create.topics.enable

My advice: set it to false!

delete.topic.enable

Set to false IF you don't need to delete topics (very often)

message.max.bytes

Probably don't change this.

Just be aware that it is ~1MB (1048588 bytes)

message.max.bytes

Probably don't change this.

Just be aware that it is ~1MB (1048588 bytes)

Retention Policies

Two Options:

1) Time Based

2) Size Based (less common)

3) Infinite Retention?!?!

Tiered Storage 4 The Win!

But what about compaction???

Kafka can "clean up" topics with deletion (time based or size based) OR compaction

(OR both, actually)

{
  key: one
  value: {...}
}
{
  key: two
  value: {...}
}
{
  key: one
  value: {...}
}
{
  key: three
  value: {...}
}
{
  key: one
  value: {...}
}

0

1

2

3

4

Before Compaction

{
  key: one
  value: {...}
}
{
  key: two
  value: {...}
}
{
  key: one
  value: {...}
}
{
  key: three
  value: {...}
}
{
  key: one
  value: {...}
}

0

1

2

3

4

After Compaction

{
  key: one
  value: {...}
}
{
  key: two
  value: {...}
}
{
  key: one
  value: {...}
}
{
  key: three
  value: {...}
}
{
  key: one
  value: {...}
}

0

1

2

3

4

Deleting a Record

{
  key: three
  value: null
}

5

Is compaction right for me?!?

Maybe.

log.cleaner.min.compaction.lag.ms

Good to be aware of.. if this is set too low, consumers may not ever get to see a record. You may or may not care 🤷‍♀️

What format should I store my data in?

Records in Kafka are just BYTES

Common Data Formats:

  • JSON
  • Avro
  • protobuf
  • CSV.. Noooooo!

JSON w/o Schemas

PROS

  • Easy to work with

CONS

  • No schemas
  • Space HOG

JSON with Schemas

PROS

  • Easy to work with
  • Schemas including rules for evolutions between schemas

CONS

  • Space HOG

Avro

PROS

  • Schemas including rules for evolutions between schemas
  • Plenty of tooling around Kafka
  • Very compact

CONS

  • Harder to work with

Protobuf

PROS

  • Schemas including rules for evolutions between schemas
  • Very compact

CONS

  • Harder to work with

CSV

So which one should I choose??!

It depends.

Of Course!

But most of the time, Avro or Protobuf. Pick your poison.

Schema Registry

thx Confluent

Unless you are going to have very few topics with very few people working with Kafka...

Use Schema Registry!

Compatibility Strategies

  • BACKWARD (the default)
  • FORWARD
  • FULL
  • *_TRANSITIVE
{
  "type": "record",
  "name": "Score",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "score",
      "type": "double"
    }
  ]
}
{
  "type": "record",
  "name": "Score",
  "fields": [
    {
      "name": "id",
      "type": "int"
    }
  ]
}

Schema Version 1

Schema Version 2

Don't leave your consumers hanging!!!

FULL_TRANSITIVE

Your new best friend and my personal recommendation

FULL_TRANSITIVE

  • Allows you to add and/or delete optional fields. That's it!
  • Decouples Producers from Consumers
  • Transitive because otherwise, you can just disregard rules if you are crafty

If you want to dig into this more: go here!

https://docs.confluent.io/current/schema-registry/avro.html

One final note.. Use the same format for both key and value

Oh also, node level schema validation!

Security... blah

If you KNOW you'll need it, then do it up front!

Your options:

  • SSL (for encryption and authentication)
  • SASL (Authentication)
  • ACLs (Authorization)
  • RBAC (If you use Confluent)

If you are enterprise, then probably just go with Confluent RBAC w/LDAP

Metrics... JMX

Recap!

  • Cluster shape and size
  • Some good configs to look at
  • Retention and Compaction
  • Data Formats
  • Security

Producers

Broker 1

Partition 1

Partition 2

Partition 3

Producer

Broker 2

Partition 1

Partition 2

Partition 3

Broker 3

Partition 1

Partition 2

Partition 3

Leader

Leader

Leader

Configs that you may want to look at

key/value.serializer

acks

The number of brokers that need to acknowledge a record as being received.

 

Default is 1, I would set it to all!

bootstrap.servers

example: `host1:port,host2:port`

 

These are only used to discover the Kafka cluster, not for repeated connections

retries & max.in.flight.requests.per.connection

 

From the docs: "Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records..."

compression.type

 

Consider enabling compression, especially if you have high volume of data

batch.size

batch.size

Throughput

Latency

Consumers

Consumer Groups

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Topic

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Configs of which to be aware

auto.offset.reset

Either "earliest" or "latest"

 

Tells the consumer where to start consuming from when no offsets have been committed.

enable.auto.commit

Defaults to true

 

Many times, you want to control when you commit offsets so you would want this to be false.

enable.auto.commit

Defaults to true

 

Many times, you want to control when you commit offsets so you would want this to be false.

Other consumer types

  • Kafka Connect - Replication
  • Spark/Flink/etc - Stream Processing (joins, etc)

Load Test All The Things

Questions?

Kafka: A Crash Course

By lewisjkl

Kafka: A Crash Course

  • 260