Kafka

Heshani Bandaranayake
4 min readMay 20, 2020

--

Introduction

Kafka is a distributed platform system started by LinkedIn which provides a framework for storing, reading and analyzing streaming data and a good solution for large scale message processing applications.

Kafka is designed to be run in a distributed environment, which means instead of sitting on one computer, it runs across several (or many) servers, leveraging the additional processing power and storage capacity that this brings.

It consists,

  1. Producer
  2. Consumer
  3. Broker
  4. Cluster
  5. Topic
  6. Partitions
  7. Offset
  8. Consumer Groups

Uses of Kafka

§ Build real-time streaming data pipe-lines

§ Enable in-memory micro services

§ Build real-time streaming applications that react to streams

§ Real-time data analytics

§ Transform, react, aggregate, join real-time data flow

Type of Kafka

To identify the type of Kafka, let’s see how it works

The producer is an application that sends messages. These messages can be anything ranging from a simple string to a complex object. The messages are sent to a broker. The broker is Kafka server and said messages are processed by other applications called consumers. The consumer is again an application that receives messages. The producers don’t send data to a recipient address. They just send it to Kafka server.

Messages get stored in a topic and consumers can subscribe to the topic and listen to those messages.The topic is an arbitrary name given to a data set.

So, producer and consumer are the applications which are rolled different characters. Producer and the consumer do not communicate directly. The type is client server.

Kafka has these features

Fault Tolerant

Kafka Streams builds on fault-tolerance capabilities integrated naively within Kafka. Kafka partitions are highly available and replicated; so when stream data is persisted to Kafka it is available even if the application fails and needs to re-process it. Tasks in Kafka Streams leverage the fault-tolerance capability offered by the Kafka consumer to handle failures. If a job on a machine runs which fails, the job automatically restarts streams of Kafka in one of the remaining running examples of the application.

Highly Available

There are some tricks that can be used to achieve high availability of a stream processing cluster during rolling upgrade. The underlying idea behind standby replicas is still valid and having hot standby machines ready to take over when the time is right is a good solution that we use to ensure high availability if and when instances die.

Even if streams of Kafka do not offer built-in functionality to reach high availability during a rolling gradient of a service, it can be still done at an infrastructure level. It’s a lightweight Java library that enables developers to write highly expansive stream processing applications.

Recoverable

There are some mechanisms of failure recovery, such as ‘at-most-once’ and ‘exactly once.’ Kafka provides durability and fault-tolerance however, we should responsible for the configuration of the corresponding parameters and the design of an architecture which can deal with fail overs in order to ensure that never lose any data.

Consistent

In order to properly tolerate a broker outage, we must first understand the behavior of a Kafka cluster during a broker outage.

During a broker outage, all partition replicas on the broker become unavailable, so the affected partitions’ availability is determined by the existence and status of their other replicas. If a partition has no additional replicas, the partition becomes unavailable. If a partition has additional replicas that are in-sync, one of these in-sync replicas will become the interim partition leader. Finally, if the partition has addition replicas but none are in-sync, we have a choice to make: either we choose to wait for the partition leader to come back online–sacrificing availability — or allow an out-of-sync replica to become the interim partition leader–sacrificing consistency.

Scalable

Kafka brings the scale of processing in message queues with the loosely-coupled architecture of publish-subscribe models together by implementing consumer groups to allow scale of processing, support of multiple domains and message reliability. Re balancing in Kafka allows consumers to maintain fault tolerance and scalability in equal measure.

Thus, using Kafka consumer groups in designing the message processing side of a streaming application allows users to leverage the advantages of Kafka’s scale efficiently.

Predictable Performance

Kafka relies on the file system for the storage and caching. The problem is disks are slower than RAM. This is because the seek-time through a disk is large compared to the time required for actually reading the data.

But if you can avoid seeking, then you can achieve latencies as low as RAM in some cases. This is done by Kafka through Sequential I/O.

One advantage of Sequential I/O is :get a cache without writing any logic in your application for it. Modern operating systems allocate most of their free memory to disk-caching. So, if you are reading in an ordered fashion, the OS can always read-ahead and store data in a cache on each disk read.

Secure

Kafka has following security features

Authentication of connections to brokers from clients (producers and consumers), other brokers and tools, using either SSL or SASL (Kerberos). SASL/PLAIN can also be used from release 0.10.0.0 onwards

Authentication of connections from brokers to ZooKeeper

Encryption of data transferred between brokers and clients, between brokers, or between brokers and tools using SSL (Note that there is a performance degradation when SSL is enabled, the magnitude of which depends on the CPU type and the JVM implementation)

Authorization of read / write operations by clients

Authorization is pluggable and integration with external authorization services is supported

--

--

No responses yet