Big Data Overview

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 23, 2021

Enjoy listening to this Blog while you are working with something else !


What is Big Data?

In very simple terms, Big data is data that exceeds the processing capacity of traditional databases. The data is too big to be processed by a single machine. New and innovative methods are required to process and store such large volumes of data.

Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. It is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications.

Why Every Organization May Need Big Data Strategy?

Despite the hype, many organizations don't realize they have a big data problem or they simply don't think of it in terms of big data. In general, an organization is likely to benefit from big data technologies when existing databases and applications can no longer scale to support sudden increases in volume, variety, and velocity of data.

Failure to correctly address big data challenges can result in escalating costs, as well as reduced productivity and competitiveness. On the other hand, a sound big data strategy can help organizations reduce costs and gain operational efficiencies by migrating heavy existing workloads to big data technologies; as well as deploying new applications to capitalize on new opportunities.

Characteristics of Big Data

In 2001, Gartner's Doug Laney first presented what became known as the 'three Vs of big data' to describe some of the characteristics that make big data different from other data processing:

Volume Ranges from terabytes to petabytes of data. Often, because the work requirements exceed the capabilities of a single computer, this becomes a challenge of pooling, allocating, and coordinating resources from groups of computers.

Velocity Increasingly, businesses have stringent requirements from the time data is generated, to the time actionable insights are delivered to the users. Data is constantly being added, massaged, processed, and analyzed in order to keep up with the influx of new information and to surface valuable information early when it is most relevant. Therefore, data needs to be collected, stored, processed, and analyzed within relatively short windows - ranging from daily to real-time

Variety Data can be ingested from internal systems like application and server logs, from social media feeds and other external APIs, from physical device sensors, and from other providers. Big data seeks to handle potentially useful data regardless of where it's coming from by consolidating all information into a single system.

Veracity It refers to the degree of accuracy in data sets and how trustworthy they are. Raw data collected from various sources can cause data quality issues that may be difficult to pinpoint. If they aren't fixed through data cleansing processes, bad data leads to analysis errors that can undermine the value of business analytics initiatives

Variability Variation in the data leads to wide variation in quality. Additional resources may be needed to identify, process, or filter low quality data to make it more useful.

Value Not all the data that's collected has real business value or benefits. As a result, organizations need to confirm that data relates to relevant business issues before it's used in big data analytics projects. Sometimes, the systems and processes in place are complex enough that using the data and extracting actual value can become difficult.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Big Data Overview
Big Data Overview

What Does a Big Data Life Cycle Look Like?

In most cases, big data processing involves a common data flow - from collection of raw data to consumption of actionable information.

Ingesting data into the system. Collecting the raw data - transactions, logs, mobile devices and more - is the first challenge many organizations face when dealing with big data. A good big data platform makes this step easier, allowing developers to ingest a wide variety of data - from structured to unstructured - at any speed - from real-time to batch.

Persisting the data in storage. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. Depending on your specific requirements, you may also need temporary stores for data in-transit.

Computing and Analyzing data. This is the step where data is transformed from its raw state into a consumable format - usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.

Visualizing the results. Big data is all about getting high value, actionable insights from your data assets. Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets. Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical 'predictions' - in the case of predictive analytics - or recommended actions - in the case of prescriptive analytics.

Tools Used in Big Data

Apache Hadoop: framework that can effectively store large amount of data in a cluster. This framework runs in parallel on a cluster and has an ability to allow us to process data across all nodes. Hadoop Distributed File System (HDFS) is the storage system of Hadoop which splits big data and distribute across many nodes in a cluster. This also replicates data in a cluster thus providing high availability.

Microsoft HDInsight by Microsoft which is available as a service: It is a Big Data solution from Microsoft powered vice in the cloud. HDInsight uses Windows Azure Blob storage as the default file system. This also provides high availability with low cost.
- NoSQL. Stands for Not Only SQL. While the traditional SQL can be effectively used to handle large amount of structured data, we need NoSQL to handle unstructured Apache Hadoop is a java based free software data. NoSQL databases store unstructured data with no particular schema. Each row can have its own set of column values. NoSQL gives better performance in storing massive amount of data. There are many open-source NoSQL DBs available to analyse big Data.
- Hive:.This is a distributed data management for Hadoop. This supports SQL-like query option HiveSQLin short HSQL to access big data. This can be primarily used for Data mining purpose. This runs on top of Hadoop.
- Sqoop. This is a tool that connects Hadoop with various relational databases to transfer data. This can be effectively used to transfer structured data to Hadoop or Hive.
- PolyBase. This works on top of SQL Server 2012 Parallel Data Warehouse (PDW) and is used to access data stored in PDW. PDW is a data warehousing appliance built for processing any volume of relational data and provides integration with Hadoop allowing us to access non-relational data as well.
- Big data in EXCEL. As many people are comfortable in doing analysis in EXCEL, a popular tool from Microsoft, you can also connect data stored in Hadoop using EXCEL 2013. Horton works, which is primarily working in providing Enterprise Apache Hadoop, provides an option to access big data stored in their Hadoop platform using EXCEL 2013. You can use Power View feature of EXCEL 2013 to easily summarise the data.
- Presto. Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop) named presto which is built to handle petabytes of data. Unlike Hive, Presto does not depend on MapReduce technique and can quickly retrieve data.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Technologies for Big Data Handling

Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in real-time and can protect data privacy and security. There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology.

Operational Big Data This includes systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. No SQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement.

Analytical Big Data This includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data. MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low-end machines. The Big Data handling techniques and tools include Hadoop, Map Reduce, and Big Table. Out of these, Hadoop is one of the most widely used technologies.

Hadoop Hadoop is an Apache open source framework which is written in java. High volumes of data, in any structure, are processed by Hadoop. Hadoop allows distributed storage and distributed processing for very large data sets. The main components of Hadoop are:
- Hadoop distributed file system (HDFS)
- MapReduce

Big Data Challenges

Unless organizations are able to manage change effectively, they won't reap the full benefits of a transition to Big Data. Some areas are particularly important in that process.

Leadership. Companies succeed in the big data not simply because they have more or better data, but because they have leadership teams that set clear goals, define what success looks like, and ask the right questions. Business leaders who can spot a great opportunity, understand how a market is developing, think creatively and propose truly novel offerings, articulate a compelling vision, persuade people to embrace it and work hard to realize it, and deal effectively with customers, employees, stockholders, and other stakeholders; will define the success of the organization in next decade.

Talent management. As data become cheaper, the complements to data become more valuable. Some of the most crucial of these are technical resources like ML engineers and data scientists and other professionals skilled at working with large data science projects. Along with the data scientists, a new generation of computer scientists are bringing to bear techniques for working with very large data sets. Expertise in the design of experiments can help cross the gap between correlation and causation. The best data scientists and engineers are also comfortable speaking the language of business and helping leaders reformulate their challenges in ways that big data can tackle. Not surprisingly, people with these skills are hard to find and in great demand.

Technology. The tools available to handle the volume, velocity, and variety of big data have improved greatly in recent years. In general, these technologies are not prohibitively expensive, and much of the software is open source. Hadoop, the most commonly used framework, combines commodity hardware with open-source software. It takes incoming streams of data and distributes them onto cheap disks; it also provides tools for analyzing the data. However, these technologies do require a skill set that is new to most IT departments, which will need to work hard to integrate all the relevant internal and external sources of data. Although attention to technology isn't sufficient, it is always a necessary component of a big data strategy.

Decision making. An effective organization puts information and the relevant decision rights in the same location. In the big data era, information is created and transferred, and expertise is often not where it used to be. The artful leader will create an organization flexible enough to minimize the 'not invented here' syndrome and maximize cross-functional cooperation. People who understand the problems need to be brought together with the right data, but also with the people who have problem-solving techniques that can effectively exploit them.

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Big Data Overview

Corporate Scholarship Career Courses