Message Queues
Log aggregators
Kafka
push
pull
broker is quite lazy
offset
- Configuration Management
- Naming service
- Group Membership
- Lock Synchronization
- Distributed Cluster Management
- Distributed Synchronization
- Leader election
- High reliable Registry
In short: Zookeeper takes care of all Metadate about kafka.
Traditional:
zero-copy:
- Metrics: operational telemetry data
- Tracking: everything a LinkedIn.com user does
- Queuing: between LinkedIn apps, e.g for sending emails
- Tens of thousands of data produces, thousands of consumers
- 7 million events/sec write, 35 million events read
Message persisted on disk, offline and online unitized
Topic/Partiton replicate across brokers,N replicates tolerates N-1 failure
Pull is better than Push, Consumer handle rate. for push model, end point can’t have lots of business logic in real-time, no further consumption
stateless, offset not maintained by the broker, consumer can deliberately rewind and re-consume data
1: producer gets an ack after the leader replica has received the data
-1: producer gets an ack after all replicas receiving the data
better durability
better
latency
Can be elastically and transparently expanded without downtime.
Choose RabbitMQ:
Choose Kafka:
Metadata, Send, Fetch, Offsets, Offset Commit, Offset Fetch
> wget http://www-us.apache.org/dist/kafka/0.9.0.0/kafka_2.11-0.9.0.0.tgz
> tar -xzf kafka_2.11-0.9.0.0.tgz
> cd kafka_2.11-0.9.0.0
> bin/zookeeper-server-start.sh config/zookeeper.properties &
> bin/kafka-server-start.sh config/server.properties
from kafka import KafkaProducer
import time
def produce():
producer = KafkaProducer(bootstrap_servers='localhost:9092')
while True:
producer.send('my-topic', b"A test message")
time.sleep(1)
if __name__ =="__main__":
produce()
pip install kafka-python
Kafka Python Client
producer usage:
from kafka import KafkaProducer
def consume():
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
auto_offset_reset='earliest')
consumer.subscribe(['my-topic'])
for message in consumer:
print message
if __name__ =="__main__":
consume()
Kafka Python Client
consumer usage:
ConsumerRecord(topic=u'my-topic', partition=0, offset=230, key=None, value='A test message')
ConsumerRecord(topic=u'my-topic', partition=0, offset=231, key=None, value='A test message')
ConsumerRecord(topic=u'my-topic', partition=0, offset=232, key=None, value='A test message')
terminal output:
from kafka import KafkaProducer
from kafka.errors import KafkaError
producer = KafkaProducer(bootstrap_servers=['broker1:1234'])
# Asynchronous by default
future = producer.send('my-topic', b'raw_bytes')
# Block for 'synchronous' sends
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
# Successful result returns assigned partition and offset
print (record_metadata.topic)
print (record_metadata.partition)
print (record_metadata.offset)
# produce keyed messages to enable hashed partitioning
producer.send('my-topic', key=b'foo', value=b'bar')
# encode objects via msgpack
producer = KafkaProducer(value_serializer=msgpack.dumps)
producer.send('msgpack-topic', {'key': 'value'})
# produce json messages
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'))
producer.send('json-topic', {'key': 'value'})
# produce asynchronously
for _ in range(100):
producer.send('my-topic', b'msg')
# block until all async messages are sent
producer.flush()
# configure multiple retries
producer = KafkaProducer(retries=5)
Producer More Usage
from kafka import KafkaConsumer
# To consume latest messages and auto-commit offsets
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['localhost:9092'])
for message in consumer:
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,
message.offset, message.key,
message.value))
# consume earliest available messages, dont commit offsets
KafkaConsumer(auto_offset_reset='earliest', enable_auto_commit=False)
# consume json messages
KafkaConsumer(value_deserializer=lambda m: json.loads(m.decode('ascii')))
# consume msgpack
KafkaConsumer(value_deserializer=msgpack.unpackb)
# StopIteration if no message after 1sec
KafkaConsumer(consumer_timeout_ms=1000)
# Subscribe to a regex topic pattern
consumer = KafkaConsumer()
consumer.subscribe(pattern='^awesome.*')
# Use multiple consumers in parallel w/ 0.9 kafka brokers
# typically you would run each on a different server / process / CPU
consumer1 = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers='my.server.com')
consumer2 = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers='my.server.com')
Consumer More Usage
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
- replacement of RabbitMQ in some cases.
- Decouple module ,buffer unprocessed messages
- Collects physical log files across servers to a file server or HDFS.
- Abstract files as a stream of messages
- Comparing with Scribe/Flume, equally performance, stronger
durability, much lower latency
- Rebuild a user activity tracking pipeline as a real-time pub-sub feeds.
- Publish site activity (page views, searches)
- Loading into Hadoop/warehouse for offline processing
- Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption.