A Brief Introduction To Apache Kafka
Pan Chuan
2016-04-22
Outlines
- What is Kafka
- Kafka's feature and design philosophy
- Comparison with other MQ
- Kafka use case
What is Kafka
- Kafka is a distributed, high-throughput messaging system
- LinkedIn original motivation: have a unified platform for handling all the real-time data feeds a large company might have
why create Kafka
Message Queues
Log aggregators
Kafka
- RabbitMQ
- ActiveMQ
- Flume
- Scribe
- LinkedIn did not satisfied existed MQ or Log aggregators systems
LinkedIn's opinion: flaw of existed system
- Often focus on offering a rich set of delivery gurantees like IBM Websphere MQ which increase complexity and may not needed.
- Do not focus on throughput as design constraint (no batching consume) like JMS.
- Weak in distributed support.
- they assume near immediate consumption of messages which makes unconsumed messages is very small, not good to offline consumers.
- or only good to offline using like Scribe
- most of them using a push model instead of pull model.
- More...
- Topic: Kafka maintains feeds of message in categories
- Producer: the process publish messages to a kafka topic
- Consumer: the process subscribe to topics
- Broker: Kafka always runs as a cluster each server is called a broker
Some Concepts
push
pull
broker is quite lazy
Kafka partition
- one topics can have multiple partitions
- one partition orders, but not orders across partitions
- A consumer instance sees messages in the order they are stored in the log
offset
Why partition
- Load balance, distributing one topic's messages to cluster avoids single machine IO bottleneck
- Each partition is replicated across a number of servers for fault tolerance
What is Zookeeper
- A high performance coordination service for distributed applications
- Centralized service for
- Configuration Management
- Naming service
- Group Membership
- Lock Synchronization
- Use case:
- Distributed Cluster Management
- Distributed Synchronization
- Leader election
- High reliable Registry
Zookeeper in Kafka
- Electing a controller. controller is a special broker to maintain the leader/follower relationship for all the partitions. when a node shuts down, controller tells other replicas to become partition leader.
- Cluster membership, which brokers are alive and part of the cluster.
- Topic configuration
- authentication, who is allowed to read and write which topic
In short: Zookeeper takes care of all Metadate about kafka.
Kafka good feature
- Fast
- Durable
- Flexible
- Scalable
Fast
- Write and read disk sequentially, O(1) time read and write. Don't fear the file system!
- Batching producing and consuming
- Gzip/snappy compression
- Zero-copy
Zero-Copy
Traditional:
zero-copy:
- Kafka uses zero-copy when consuming
Kafka performance
- Kafka VS Rabbitmq/activemq
Kafka@LinkedIn 2014
- data type is being transported through Kafka:
- Metrics: operational telemetry data
- Tracking: everything a LinkedIn.com user does
- Queuing: between LinkedIn apps, e.g for sending emails
- in total 200 billion events/day via Kafka:
- Tens of thousands of data produces, thousands of consumers
- 7 million events/sec write, 35 million events read
Durable
-
Message persisted on disk, offline and online unitized
-
Topic/Partiton replicate across brokers,N replicates tolerates N-1 failure
Flexible
-
Pull is better than Push, Consumer handle rate. for push model, end point can’t have lots of business logic in real-time, no further consumption
-
stateless, offset not maintained by the broker, consumer can deliberately rewind and re-consume data
- Prouder load balance (random, RoundRobin, hash(key))
Kafka Message Acking
- 0: producer never waits for an ack from the broker
-
1: producer gets an ack after the leader replica has received the data
-
-1: producer gets an ack after all replicas receiving the data
better durability
better
latency
Scalable
-
Can be elastically and transparently expanded without downtime.
Kafka VS RabbitMQ
- RabbitMQ is broker-centric, focused around delivery guarantees between producers and consumers, with transient preferred over durable messages
- Kafka is producer-centric, based around partitioning a large amount of event data into durable message brokers with cursors, supporting batch consumers offline or online consumers who wants a low latency
Kafka VS RabbitMQ: how to choose
- you have a fire hose of events (100k+/sec) need to delivered in partitioned order 'at least once' with a mix of online and batch consumers
- want to re-consume
- you have messages (20k+/sec) need to be routed in complex ways to consumes.
- want per-message delivery guarantees, don't care about ordered delivery
- need 24*7 paid support
Choose RabbitMQ:
Choose Kafka:
Kafka VS RedisMQ
- Redis needs as much memory as there are messages in flight, better to use when have short lived messages and wish more consumer capacity
- Kafka keeps messages much longer, for batch and real-time consuming
- quite different use case, Redis is only useful for online operational messaging while Kafka is best used in high volume data processing pipelines
RabbitMQ VS RedisMQ
- When enqueue, Redis has higher performance for small size messages, but quickly becomes untolerable slow when message size bigger than 10K
- When dequeue, Redis performs much better than RabbitMQ for whichever size data
Kafka Client
- Kafka uses a binary protocol over TCP which defines all APIs as reqeust response message pairs.
- Kafka protocol is fairly simple,only six core client requests APIs.
Metadata, Send, Fetch, Offsets, Offset Commit, Offset Fetch
- A client is easily to implement , just follow the protocol defined.
Kafka Client
Kafka quick start
> wget http://www-us.apache.org/dist/kafka/0.9.0.0/kafka_2.11-0.9.0.0.tgz
> tar -xzf kafka_2.11-0.9.0.0.tgz
> cd kafka_2.11-0.9.0.0
> bin/zookeeper-server-start.sh config/zookeeper.properties &
> bin/kafka-server-start.sh config/server.properties
from kafka import KafkaProducer
import time
def produce():
producer = KafkaProducer(bootstrap_servers='localhost:9092')
while True:
producer.send('my-topic', b"A test message")
time.sleep(1)
if __name__ =="__main__":
produce()
pip install kafka-python
Kafka Python Client
producer usage:
from kafka import KafkaProducer
def consume():
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='earliest')
consumer.subscribe(['my-topic'])
for message in consumer:
print message
if __name__ =="__main__":
consume()
Kafka Python Client
consumer usage:
ConsumerRecord(topic=u'my-topic', partition=0, offset=230, key=None, value='A test message')
ConsumerRecord(topic=u'my-topic', partition=0, offset=231, key=None, value='A test message')
ConsumerRecord(topic=u'my-topic', partition=0, offset=232, key=None, value='A test message')
terminal output:
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(bootstrap_servers=['broker1:1234'])
# Asynchronous by default
future = producer.send('my-topic', b'raw_bytes')
# Block for 'synchronous' sends
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
# Successful result returns assigned partition and offset
print (record_metadata.topic)
print (record_metadata.partition)
print (record_metadata.offset)
# produce keyed messages to enable hashed partitioning
producer.send('my-topic', key=b'foo', value=b'bar')
# encode objects via msgpack
producer = KafkaProducer(value_serializer=msgpack.dumps)
producer.send('msgpack-topic', {'key': 'value'})
# produce json messages
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'))
producer.send('json-topic', {'key': 'value'})
# produce asynchronously
for _ in range(100):
producer.send('my-topic', b'msg')
# block until all async messages are sent
producer.flush()
# configure multiple retries
producer = KafkaProducer(retries=5)
Producer More Usage
from kafka import KafkaConsumer
# To consume latest messages and auto-commit offsets
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['localhost:9092'])
for message in consumer:
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,
message.offset, message.key,
message.value))
# consume earliest available messages, dont commit offsets
KafkaConsumer(auto_offset_reset='earliest', enable_auto_commit=False)
# consume json messages
KafkaConsumer(value_deserializer=lambda m: json.loads(m.decode('ascii')))
# consume msgpack
KafkaConsumer(value_deserializer=msgpack.unpackb)
# StopIteration if no message after 1sec
KafkaConsumer(consumer_timeout_ms=1000)
# Subscribe to a regex topic pattern
consumer = KafkaConsumer()
consumer.subscribe(pattern='^awesome.*')
# Use multiple consumers in parallel w/ 0.9 kafka brokers
# typically you would run each on a different server / process / CPU
consumer1 = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers='my.server.com')
consumer2 = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers='my.server.com')
Consumer More Usage
set a cluster
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
A tool for managing Apache Kafka
Kafka use case
Kafka use case
- Messaging:
- replacement of RabbitMQ in some cases.
- Decouple module ,buffer unprocessed messages
- Log Aggregation:
- Collects physical log files across servers to a file server or HDFS.
- Abstract files as a stream of messages
- Comparing with Scribe/Flume, equally performance, stronger
durability, much lower latency
Kafka use case
- Website Activity Tracking:
- Rebuild a user activity tracking pipeline as a real-time pub-sub feeds.
- Publish site activity (page views, searches)
- Loading into Hadoop/warehouse for offline processing
Kafka use case
- Stream Processing:
- Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption.
- Stream Processing
- Hadoop Integration
- Search and Query
- Management Consoles
- AWS Integration
- Logging
- Flume - Kafka plugins
- Metrics
- Packing and Deployment
- Kafka Camel Integration
- Misc
Kafka Tips
- LinkedIn engineers who built Kafka have founded Confluent to build a data stream platform using Kafka
useful reference
Kafka
By panchuan
Kafka
- 1,528