Kafka: A Crash Course
Me
- Data Platform @ Pluralsight
- @lewisjkl on Twitter
- Questions in UGE Slack
Outline
- What is Kafka?
- Administering Kafka
- Producing to Kafka
- Consuming from Kafka
What is Kafka?
Pub Sub!
Basically...
A distributed log
The Log Data Structure
Append only, immutable, ordered sequence.
...
0
1
2
3
n
Distributed
Kafka is run as a cluster of brokers.
Topic
Partition
Replication
Topic
A group of one or more partitions
Partition
A single log
...
0
1
2
3
n
Which partition does my record go inside of?
{key, value}
hash(key) % count(partitions)
Partition 0
Partition 1
Partition 2
{
"key": "test",
"value": {...}
}def hash(input: String): Long = ???
val keyHash = hash(key) // Let's say this returns 100, for example
val numOfPartitions = 3
val destinationPartition = keyHash % numOfPartitions // 1Partition 0
Partition 1
Partition 2
Replication
Each partition is stored on n different brokers where n is the replication factor
If you have 5 brokers and a replication factor of 3, each partition will be stored on 3 different brokers.
Fault tolerance, yay!
Administering Kafka
Don't self-host unless you have a REALLY GOOD REASON
Confluent Cloud
AWS MSK
Many Others
What should my Kafka cluster "look" like?
How much data will you have in your cluster at once?
Example: 30TB cluster with 5TB capacity per node.. Min of 6 nodes
Replication factor of 2.. now you need 12 nodes!
Okay, but what if I am just getting started? I don't need 30TB of capacity!
My recommendation...
If you are low-key: 3 brokers with replication factor of 2.
If you have high availability requirements: 5 brokers with replication factor of 3.
Good configs to be aware of
auto.create.topics.enable
My advice: set it to false!
delete.topic.enable
Set to false IF you don't need to delete topics (very often)
message.max.bytes
Probably don't change this.
Just be aware that it is ~1MB (1048588 bytes)
message.max.bytes
Probably don't change this.
Just be aware that it is ~1MB (1048588 bytes)
Retention Policies
Two Options:
1) Time Based
2) Size Based (less common)
3) Infinite Retention?!?!
Tiered Storage 4 The Win!
But what about compaction???
Kafka can "clean up" topics with deletion (time based or size based) OR compaction
(OR both, actually)
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
Before Compaction
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
After Compaction
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
Deleting a Record
{
key: three
value: null
}5
Is compaction right for me?!?
Maybe.
log.cleaner.min.compaction.lag.ms
Good to be aware of.. if this is set too low, consumers may not ever get to see a record. You may or may not care 🤷♀️
What format should I store my data in?
Records in Kafka are just BYTES
Common Data Formats:
- JSON
- Avro
- protobuf
- CSV.. Noooooo!
JSON w/o Schemas
PROS
- Easy to work with
CONS
- No schemas
- Space HOG
JSON with Schemas
PROS
- Easy to work with
- Schemas including rules for evolutions between schemas
CONS
- Space HOG
Avro
PROS
- Schemas including rules for evolutions between schemas
- Plenty of tooling around Kafka
- Very compact
CONS
- Harder to work with
Protobuf
PROS
- Schemas including rules for evolutions between schemas
- Very compact
CONS
- Harder to work with
CSV
So which one should I choose??!
It depends.
Of Course!
But most of the time, Avro or Protobuf. Pick your poison.
Schema Registry
thx Confluent
Unless you are going to have very few topics with very few people working with Kafka...
Use Schema Registry!
Compatibility Strategies
- BACKWARD (the default)
- FORWARD
- FULL
- *_TRANSITIVE
{
"type": "record",
"name": "Score",
"fields": [
{
"name": "id",
"type": "int"
},
{
"name": "score",
"type": "double"
}
]
}{
"type": "record",
"name": "Score",
"fields": [
{
"name": "id",
"type": "int"
}
]
}Schema Version 1
Schema Version 2
Don't leave your consumers hanging!!!
FULL_TRANSITIVE
Your new best friend and my personal recommendation
FULL_TRANSITIVE
- Allows you to add and/or delete optional fields. That's it!
- Decouples Producers from Consumers
- Transitive because otherwise, you can just disregard rules if you are crafty
If you want to dig into this more: go here!
https://docs.confluent.io/current/schema-registry/avro.html
One final note.. Use the same format for both key and value
Oh also, node level schema validation!
Security... blah
If you KNOW you'll need it, then do it up front!
Your options:
- SSL (for encryption and authentication)
- SASL (Authentication)
- ACLs (Authorization)
- RBAC (If you use Confluent)
If you are enterprise, then probably just go with Confluent RBAC w/LDAP
Metrics... JMX
Recap!
- Cluster shape and size
- Some good configs to look at
- Retention and Compaction
- Data Formats
- Security
Producers
Broker 1
Partition 1
Partition 2
Partition 3
Producer
Broker 2
Partition 1
Partition 2
Partition 3
Broker 3
Partition 1
Partition 2
Partition 3
Leader
Leader
Leader
Configs that you may want to look at
key/value.serializer
acks
The number of brokers that need to acknowledge a record as being received.
Default is 1, I would set it to all!
bootstrap.servers
example: `host1:port,host2:port`
These are only used to discover the Kafka cluster, not for repeated connections
retries & max.in.flight.requests.per.connection
From the docs: "Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records..."
compression.type
Consider enabling compression, especially if you have high volume of data
batch.size
batch.size
Throughput
Latency
Consumers
Consumer Groups
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Consumer 3
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Configs of which to be aware
auto.offset.reset
Either "earliest" or "latest"
Tells the consumer where to start consuming from when no offsets have been committed.
enable.auto.commit
Defaults to true
Many times, you want to control when you commit offsets so you would want this to be false.
enable.auto.commit
Defaults to true
Many times, you want to control when you commit offsets so you would want this to be false.
Other consumer types
- Kafka Connect - Replication
- Spark/Flink/etc - Stream Processing (joins, etc)
Load Test All The Things
Questions?
Kafka: A Crash Course
By lewisjkl
Kafka: A Crash Course
- 260