A Guide for Choosing Big Data Platform

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 24, 2021

Enjoy listening to this Blog while you are working with something else !


"The presumption is, self-determination is a good thing and choice is essential to self-determination." - The Paradox of Choice.

I am not quite sure when the plain "old data" became "big data". But it's pretty well established that Data powers strategic decision-making, unearths new revenue streams, and reveals hidden waste generators eating into your bottom line.

That being said, data alone can't do any of those things. Organizations still need to assemble a tech stack that connects all relevant data sources. Unfortunately, selecting the data analytics tools and platforms that put organizations on the right path isn't easy. Choosing the right big data platform from many different available choices, can be a daunting task.

Every organization seeking to make sense of big data must determine which platforms and tools, in the sea of available options, will help them to meet their business goals. Let us break down the entire process in steps.

What is a Big Data Platform

A big data platform is an enterprise class integrated IT solution that allows you to ingest, process, store, access, analyze, and visualize big data. The platform provides several tools, hardware and utilities into one packaged solution for managing and analyzing Big Data.

Depending on its use; this platform can be broken down into 3 parts:
- Data Lake or Data Warehouse: ingest, process, store, access.
- Business Intelligence: analyze and present.
- Data Science: Statistics, Machine Learning and Artificial Intelligence (a special form of analysis).

What Are The Available Platform?

Your search for major Big Data Analytics Platform will pretty much come down to following choices :
- Microsoft Azure
- Cloudera
- Google Cloud
- Tableau
- MapR
- Hortonworks
- Oracle
- MongoDB
- Amazon Web Services
- IBM Cloud and many more...

Many of them are Cloud based. One sure bet can be to look for vendors for "try and buy". But, its not that easy either. Setting up a trial test bed to test out the tool in your own processing environment, using your own data, learning the platform, data models, and algorithms, involved a lot of technical and other complications. This is where the Business Process owners turns to expert companies who can assist them in the process.

Different platform offers different services. They have different capabilities. They are configured differently to serve differently. Most of them might be applied to majority of the use cases. There are multiple steps that organizations can take on their own to make the process to choose the platform a bit easier. Its a maturity process. Because big data applies to such a broad spectrum of use cases, applications, and industries, it's hard to nail down a definitive list of selection criteria.

Businesses differ in analytics requirements and needs but should seek new platforms from the vantage point of what is already in place. With analytics and machine learning becoming an important core competency, technologists should consider deepening their understanding of the available platforms and their capabilities. The power and value of analytics platforms will only increase, as will their influence throughout the enterprise.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

A Guide for Choosing Big Data Platform
A Guide for Choosing Big Data Platform

The Criteria to Choose the Platform

In order to choose the right big data platform, it's important to understand the transactional and analytical data processing requirements of your systems and choose accordingly. The decision to choose a particular platform for a certain application usually depends on the following important factors: data size, speed or throughput optimization and model development.

On-premise vs cloud-based big data platforms

The first big decision you'll need to make is whether you want to host your big data software in your own data center or if you want to use a cloud-based solution. Currently, more organizations seem to be opting for the cloud. Cloud-based big data applications are popular for several reasons, including scalability and ease of management. The major cloud vendors are also leading the way with artificial intelligence and machine learning research, which is allowing them to add advanced features to their solutions. However, cloud isn't always the best option. Organizations with high compliance or security requirements sometimes find that they need to keep sensitive data on premises. In addition, some organizations already have investments in existing on-premises data solutions, and they find it more cost effective to continue running their big data applications locally or to use a hybrid approach.

Data size

The size of data that is being considered for processing is probably the another most important factor. If the data can fit into the system memory, then clusters are usually not required and the entire data can be processed on a single machine. The platforms such as GPU, Multicore CPUs etc. can be used to speed up the data processing in this case. If the data does not fit into the system memory, then one has to look at other cluster options such as Hadoop, Spark etc. The user has to decide if he needs to use off-the-shelf tools which are available for Hadoop or if he wants to optimize the cluster performance in which case Spark is more appropriate.

Speed & throughput optimization

Here, speed refers to the ability of the platform to process data in real-time whereas throughput refers to the amount of data that system is capable of handling and processing simultaneously. The users will need to be clear about whether the goal is to optimize the system for speed or throughput. If one needs to process large amount of data and do not have strict constraints on the processing time, then one can look into systems which can scale out to process huge amounts of data such as Hadoop, Peer-to-Peer networks, etc. These platforms can handle large-scale data but usually take more time to deliver the results. On the other hand, if one needs to optimize the system for speed rather than the size of the data, then they need to consider systems which are more capable of real-time processing such as GPU, FPGA etc. These platforms are capable of processing the data in real-time but the size of the data supported is rather limited.

Training & applying a model

In big data analytics, training of the model is typically done offline and it usually takes a significant amount of time and resources. A model is typically applied in an online environment where the user expects the results within a short period of time (almost instantaneously). This creates a strong need for investigating different platforms for training and applying a model depending on the end-user application. Usually, during the training process, the user needs to deal with large amount of training data and since training is done offline, the processing time is not critical. This makes horizontal scale out platforms suitable for training a model. When a model is applied, results are expected in real-time and vertical scale up platforms are usually preferred. GPUs are preferable over other platforms when the user has strict real-time constraints and the data can be fit into the system memory. On the other hand, the HPC clusters can handle more data compared to GPUs but are not suitable for real-time processing compared to GPUs.

Data Architecture

  The core data architecture is one of the most important aspects in a data platform architecture, but by far not the only one. The goal is to find a suitable storage and database solution to meet your requirements. There are three basic options to choose from with impact on all the other topics below:

- Relational Database

  Very mature database systems with a lot of built-in intelligence to handle large data sets efficiently. These systems have the most expressive analysis tools, but often are more complex with respect to maintenance. Nowadays, these systems are also available as distributed systems and can handle extremely large data sets.

- NoSQL Style Sharding Database

  These systems sacrifice some of the classical features of relational databases for more power in other areas. These systems offer much less analytical power by themselves, but are easier to manage, have high availability and easy backups. They are designed to handle extremely large datasets. There are efforts to mimic the analytical power of SQL that is available in the relational database world with tools on top (Impala, Hive). If you have huge amounts of data or specific requirements for streaming or real-time data you should take a look at these specialized data systems. Examples are Cassandra, the Hadoop Ecosystem, Elasicsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, and Dgraph.

- File-Based Systems

  It is possible to design a big data strategy solely on files. File structures like the Parquet file standard enable you to use very cheap storage to store very large data sets distributed over many storage nodes or on a Cloud Object Store like Amazon S3. The main advantage is that the data storage system suffices to respond to data access requests. The two systems above need to run services on extra compute nodes to respond to data queries. With a solution like Apache Drill you can query parquet files with similar comfort that is known from SQL.

Process Automation

When you have many sources, targets, and data transformation processes in between, you also have many dependencies and with this a certain run schedule logic. The automation of processes is part of every data warehouse and involves a lot of complexity. There are dedicated tools (Apache Airflow, Automate, Control-M, Luigi etc) to handle only the scheduling of processes.

Process automation also requires you to manage the selection of data chunks to process, i.e. in an incremental load scenario every execution of a process needs to incrementally pick specific chunks of source data to pass on to the target. This Data Scope Management is usually implemented by a metadata driven approach, i.e. there are dedicated metadata tables that keep track of the process state of each chunk and can be queried to coordinate the processing of all chunks.

Import Interfaces

We categorize import interfaces into three different sections:

- Files
Still the most common form of data.

- Web Services
Plenty available on the net with relevant data.

- Databases
Although in many organizations a lot of data is stored in traditional databases, in most cases direct database access is not exposed to the internet and therefore not available for Cloud Data Platforms. Web services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.

Real-time streams Real-time data streams as delivered by messaging routers (speaking WAMP, MQTT, AMQP, etc) and are not used very much today but are going to become more important with the rise of IoT.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Tips for Selecting a Big Data Platform

Clearly, choosing the right big data application is a complicated process that involves a myriad of factors. Experts and organizations that have successfully deployed big data software offer the following advice:

- Understand your goals
As previously mentioned, knowing what you want to accomplish is of paramount importance when choosing a big data application. If you aren't sure why you are investing in a particular technology, your project is unlikely to succeed.

- Start small
If you can demonstrate success with a small-scale big data analytics project, that will generate interest in using the tool throughout the company.

- Take a holistic approach
While a small-scale project can help you gain experience and expertise with your technology, it's important to choose an application that can ultimately be used throughout the business. IT professionals need to create a new end-to-end architecture built for agility, scale and experimentation as analytics is becoming more holistic.

- Work together
Goes beyond any explanation, the key to succeed is that organizations attempt to build a data-driven culture, and that requires a great deal of cooperation among business and IT leaders. Silos won't work.


All big data tools have a learning curve, but some tools are supported better than others. The power and value of analytics platforms will only increase, as will their influence throughout the enterprise. Create tangible deliverables that accelerate your data and analytics modernization initiatives. During the RFP process, thoroughly look into your vendor's support options. Once you have narrowed down your options, you'll need to evaluate the big data applications you are considering using the parameters that suits you organization best.
The main takeaway here is that you should have a clear business case and strategy in place before investing in any solutions.

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

A Guide for Choosing Big Data Platform

Corporate Scholarship Career Courses