A distributed log
Append only, immutable, ordered sequence.
0
1
2
3
n
Kafka is run as a cluster of brokers.
0
1
2
3
n
Partition 0
Partition 1
Partition 2
{
"key": "test",
"value": {...}
}def hash(input: String): Long = ???
val keyHash = hash(key) // Let's say this returns 100, for example
val numOfPartitions = 3
val destinationPartition = keyHash % numOfPartitions // 1Partition 0
Partition 1
Partition 2
Fault tolerance, yay!
If you are low-key: 3 brokers with replication factor of 2.
If you have high availability requirements: 5 brokers with replication factor of 3.
Just be aware that it is ~1MB (1048588 bytes)
Just be aware that it is ~1MB (1048588 bytes)
1) Time Based
2) Size Based (less common)
3) Infinite Retention?!?!
(OR both, actually)
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
Before Compaction
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
After Compaction
{
key: one
value: {...}
}{
key: two
value: {...}
}{
key: one
value: {...}
}{
key: three
value: {...}
}{
key: one
value: {...}
}0
1
2
3
4
Deleting a Record
{
key: three
value: null
}5
Maybe.
PROS
CONS
PROS
CONS
PROS
CONS
PROS
CONS
Of Course!
thx Confluent
{
"type": "record",
"name": "Score",
"fields": [
{
"name": "id",
"type": "int"
},
{
"name": "score",
"type": "double"
}
]
}{
"type": "record",
"name": "Score",
"fields": [
{
"name": "id",
"type": "int"
}
]
}Schema Version 1
Schema Version 2
Don't leave your consumers hanging!!!
Your new best friend and my personal recommendation
https://docs.confluent.io/current/schema-registry/avro.html
Broker 1
Partition 1
Partition 2
Partition 3
Producer
Broker 2
Partition 1
Partition 2
Partition 3
Broker 3
Partition 1
Partition 2
Partition 3
Leader
Leader
Leader
The number of brokers that need to acknowledge a record as being received.
Default is 1, I would set it to all!
example: `host1:port,host2:port`
These are only used to discover the Kafka cluster, not for repeated connections
From the docs: "Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records..."
Consider enabling compression, especially if you have high volume of data
batch.size
Throughput
Latency
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Consumer 3
Topic
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Either "earliest" or "latest"
Tells the consumer where to start consuming from when no offsets have been committed.
Defaults to true
Many times, you want to control when you commit offsets so you would want this to be false.
Defaults to true
Many times, you want to control when you commit offsets so you would want this to be false.