Hadoop ecosystem, components and its architecture

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 27, 2021
        |  

Enjoy listening to this Blog while you are working with something else !

   

Introduction

The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single server to thousands of machines, each offering local computation and storage. Apache Hadoop is an open source software for affordable supercomputing; it provides the distributed file system and the parallel processing required to run a massive computing cluster. This learning path provides an explanation and demonstration of the most popular components in the Hadoop ecosystem. It defines and describes theory and architecture, while also providing instruction on installation, configuration, usage, and low-level use cases for the Hadoop ecosystem.

History of Hadoop ecosystem

In 2002, internet researchers just wanted a better search engine, and preferably one that was open-sourced. they called their project Nutch. In the year 2004, Google presented a new Map/Reduce algorithm designed for distributed computation. This evolved into MapReduce, a basic component of Hadoop. Google never implemented MR or GFS. Apache released the project in year 2005. In 2006, when Doug joined Yahoo!, he named the project after his son's toy elephant, and Hadoop was born. In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it. It was a big thing in those days. Soon Facebook and the New York Times also started using Hadoop. In April 2008, Hadoop defeated supercomputers and became the fastest system on the planet by sorting an entire terabyte of data in just 68 seconds. Doug joined Cloudera in 2009. Cloudera is a sponsor of the Apache Software Foundation and is focused on enterprise-class deployments of the technology. In December of 2011, Apache Software Foundation released Apache Hadoop version 1.0. Hortonworks merged with Cloudera in January, 2019.

Three Basic Components of Hadoop Architecture

HDFS: Hadoop Distributed File System

HDFS is the primary component of Hadoop and is responsible for storing large data sets of structured or unstructured data across various nodes. HDFS is a fault-tolerant and self-healing distributed filesystem designed to turn a cluster of industry-standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility, and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high-bandwidth streaming, and scales to proven deployments of 100PB and beyond.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system's clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

YARN: Yet Another Resource Negotiator

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

The ResourceManager has two main components:
- The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application
- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

MapReduce: Programming based Data Processing

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Hadoop ecosystem, components and its architecture
Hadoop ecosystem, components and its architecture

Applications Running Natively In Hadoop Framework

Spark: In-Memory data processing

Spark - is a unified analytics engine for large -scale data processing. It creates a Resilient Distributed Dataset(RDD) which helps it to process data fast. RDDs are fault tolerance collections of elements that can be distributed and processed in parallel across multiple nodes in a cluster. Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

At a fundamental level, an Apache Spark application consists of two main components:
-a driver, which converts the user's code into multiple tasks that can be distributed across worker nodes,
executors, which run on those nodes execute the tasks assigned to them.
Some form of cluster manager is necessary to mediate between the two. The advantage of Spark is it's speed. Spark's in-memory data engine means that it can perform tasks up to one hundred times faster than MapReduce in certain situations, particularly when compared with multi-stage jobs that require the writing of state back out to disk between stages.

Learning Apache Spark is easy whether you come from a Java, Scala, Python, R, or SQL background.

PIG, HIVE: Query based processing of data services

Pig is a platform for analyzing large data sets that consists of a high-level language for express­ing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig's language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

HBase: NoSQL Database

HBase is an open-source column-oriented non-relational distributed(NoSQL) database modeled for real-time read/write access to big data. It is based on top of HDFS, it is used for exposing data on clusters to transactional platforms.

It supports all data types and so can handle any data type inside a Hadoop system. HBase itself is written in Java and its applications are written using REST, Thrift APIs and Avro. HBase is designed to solve the problems, where a small amount of data or information is to be searched in a huge amount of data or database. It is designed to handle non-database related data. It does not support SQL queries, however, the SQL queries can run inside HBase using another tool from the Apache vendor like Hive, it can run inside HBase and can perform database operations. HBase is designed to store structured data, which may have billions of rows and columns.

HBase has two main components:
- HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data storage or retrieval path.
- RegionServer is deployed on each machine and hosts data and processes I/O requests.

Mahout, Spark MLLib: Machine Learning algorithm libraries

Mahout is a distributed linear algebra framework and mathematically expressive Scala DSL design to quickly implement algorithms. It integrates scalable machine learning algorithms to big data. Through this, we can design self-learning machines, which can be used for explicit programming. Moreover, such machines can learn by the past experiences, user behavior and data patterns. Just like artificial intelligence it can learn from the past experience and take the decisions as well. Mahout can perform clustering, filtering and collaboration operations,

Spark MLLiB solves the complexities surrounding distributed data used in machine learning. It simplifies the development and deployment of scalable machine learning pipelines

Zookeeper: Managing cluster

ZooKeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. Distributed applications use Zookeeper to store and mediate updates to important configuration information. Zookeeper is used for coordinating and managing services in a distributed environment, it is used for tracking nodes.

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as znodes. Every znode is identified by a path, with path elements separated by a slash ('/'). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.

This is much like a normal file system, but ZooKeeper provides superior reliability through redundant services. A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs. Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.

Oozie: Job Scheduling

Apache Oozie Workflow Scheduler for Hadoop is a workflow and coordination service for managing Hadoop jobs in parallel. Its major components are workflow engine for creating Directed Acyclic Graphs(DAG) for workflow jobs and coordinator engine used for running workflow jobs.
- Oozie Workflow jobs are Directed Acyclic Graphs (DAGs) of actions; actions are Hadoop jobs (such as MapReduce, Streaming, Hive, Sqoop and so on) or non-Hadoop actions such as Java, shell, Git, and SSH.
- Oozie Coordinator jobs trigger recurrent Workflow jobs based on time (frequency) and data availability.
- Oozie Bundle jobs are sets of Coordinator jobs managed as a single job.

Oozie is an extensible, scalable and data-aware service that you can use to orchestrate dependencies among jobs running on Hadoop.

Kafka:

Kafka is also a real-time streaming data architecture that provides real-time analytics. Kafka is a distributed commit log service that functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and fault tolerance. Increasingly popular for log collection and stream processing, it is often (but not exclusively) used in tandem with Apache Hadoop, Apache Storm, and Spark Streaming.

A log can be considered as a simple storage abstraction. Because newer entries are appended to the log over time, from left to right, the log entry number is a convenient proxy for a timestamp. Conceptually, a log can be thought of as a time-sorted file or table.

Kafka integrates this unique abstraction with traditional publish/subscribe messaging concepts (such as producers, consumers, and brokers), parallelism, and enterprise features for improved performance and fault tolerance. The result is an architecture that, at a high level, looks like the following figure. (A topic is a category of messages that share similar characteristics.)

Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

What are the challenges of using Hadoop?

MapReduce programming is not a good match for all problems. It's good for simple information requests and problems that can be divided into independent units, but it's not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don't intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.

There's a widely acknowledged talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That's one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.

Data security. Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.

Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

Conclusion

All the elements of the Hadoop Ecosystem are open system Apache Hadoop Project. Hadoop Ecosystem Architecture is simple to understand from a high level, but each of the components of the architecture is big subjects on its own. A number of companies offer commercial implementations or support for Hadoop. together with their inbuilt professional services. These bundles are worth your consideration. Hadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The cloud allows organizations to deploy Hadoop without the need to acquire hardware or specific setup expertise.

More than 80% of the Fortune 100 companies uses Hadoop, need Hadoop a better use case ?

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Hadoop ecosystem, components and its architecture
                         



Corporate Scholarship Career Courses