7 V's of Big Data

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 25, 2021

Enjoy listening to this Blog while you are working with something else !


The practice of gathering and storing large amounts of information, and then attempting to make sense of that information has been around for centuries. Big data is certainly not easy to grasp, especially with such vast amounts and varieties of data today. To help make sense of big data, experts have broken it down into three easier-to-understand segments. The internet of things (IoT) revolutionized big data in 2014. With an internet-connected world, more businesses decided to shift spending towards big data to reduce operational costs, boost efficiency, and develop new products and services. Emerging technologies like artificial intelligence and machine learning are harnessing big data for future automation and helping humans unveil new solutions. The big data market is accelerating at seriously mind-boggling speeds. One of the main reasons for this acceleration can be tied to IoT. Making sense of all this data, and using it to derive unique, cost-effective, and potentially groundbreaking discoveries, is where the real value of big data lies. Big data is certainly not easy to grasp, especially with such vast amounts and varieties of data today. To help make sense of big data, experts have broken it down into 3 (or 5 or 7 or ...) easier-to-understand segments. Some of the most important of these V's are as below:


When we talk about Big Data we mean BIG. Unimaginably big. Simply stated, big data is to big to work on one computer. This is a relative definition, as what can't work on today's computer will easily work on computers in the future.
- One Google search uses the computing power of the entire Apollo space mission. - Excel used to hold up to 65k rows in a single spreadsheet. Now it holds over a million. Big data volume defines the 'amount' of data that is produced. The value of data is also dependent on the size of the data. Nobody knows for sure how much data is being created today. Some experts says it amounts to roughly 2.5 quintillion bytes of data created every single day. There are 18 zeros in a quintillion. To get en idea of 2.5 quintillion bytes: its like 750,000,000 HD quality DVD's. By the year 2025, it is expected to reach 463 quintillion bytes. As per the industry experts, Google, Facebook, Microsoft, and Amazon holds about 50% of all this data created daily. Since its inception in around 2010-2012; Big data is doubling every 2 years or so. Today data is generated from various sources in different formats - structured and mostly unstructured. Some of these data formats include word and excel documents, PDFs and reports along with media content such as images and videos. Due to the data explosion caused to digital, social media, and mobile apps, data is rapidly being produced in such large chunks, it has become challenging for enterprises to store and process it using conventional methods of business intelligence and analytics. Enterprises must implement modern business intelligence tools to effectively capture, store and process such unprecedented amount of data in real-time.


No, data velocity doesn't mean it travels at warp speed. It means that data flows into organizations at an ever accelerating rate. And the faster you can process and analyze that data, the faster you can respond compared with your competitors. Velocity refers to the speed at which the data is generated, collected and analyzed. Data continuously flows through multiple channels such as computer systems, networks, social media, mobile phones etc. In today's data-driven business environment, the pace at which data grows can be best described as 'torrential' and 'unprecedented'. Now, this data should also be captured as close to real-time as possible, making the right data available at the right time. The speed at which data can be accessed has a direct impact on making timely and accurate business decisions. Even a limited amount of data that is available in real-time yields better business results than a large volume of data that needs a long time to capture and analyze. Several Big data technologies today allow us to capture and analyze data as it is being generated in real-time. Though, experts agree that Volume is more important. Velocity can be more important than volume often, because it can give us a bigger competitive advantage. Sometimes it's better to have limited data in real time than lots of data at a low speed. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action. The data have to be available at the right time to make appropriate business decisions. Why Velocity is complex? Let me give you an example, A single Jet engine generates more than 10 TB of data in 30 minutes of flight time. Now imagine how much data you would have to collect to research one small aero company. Data never stops growing, and every new day you have more information to process than yesterday. This is why working with big data is so complicated. Sadly, the rate at which data is growing is quickly outpacing our ability to decipher it. Given that the amount of data in the world is doubling in size every two years. Even more unfortunate is the fact that 3 percent of the world's data is organized with only 0.5 percent actually ready to be analyzed. I read somewhere on internet that, the big data universe is expanding much like our physical universe of stars, planets, galaxies, and dark matter. Hardware deals primarily with Volume and Velocity as these are physical constraints of the data.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs
7 V's of Big Data
7 V's of Big Data


What is the data part of Big Data? The majority of today's data does not come in neatly organized packages. They do not fit in the static tables of traditional, structured databases. In fact, more than 80% of today's data are unstructured.

Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. There weren't a whole lot of options to use this data, aside from simple classification or perhaps finding a trend. Big data has drastically changed the data landscape. With the rise of big data, data comes in new unstructured data types. Unstructured and semistructured data types, such as text, emails, images, audio, and video, require additional preprocessing to derive meaning and support metadata. Data sources may involve external sources as well as internal business units. The importance of these sources of information varies depending on the nature of the business and the underlying problem. These data can have many layers, with different values. This data is not homogeneous. In order to derive any insights from this data, you need to classify and organize it first.

In data science, this is often referred to as data cleaning, this operation is frequently the most labor intensive as it involves all of the pre-work required to set-up the high-performance compute. This is where the vast majority of errors and issues are found with data and this is the fundamental bottle neck in high-performance computing.
- Structured data is data that is generally well organized and it can be easily analyzed by a machine or by humans - it has a defined length and format.
- Semi-structured data is a form that only partially conforms to the traditional data structure (e.g. log files) - it is a mix between structured and unstructured data and because of that some parts can be easily organized and analyzed, while other parts need a machine that will sort it out.
- Unstructured data is unorganized information that can be described as chaotic - almost 80% of all data is unstructured in nature (e.g. texts, pictures, videos, mobile data, etc).

Each data type has its own uniqueness in terms of size and how it's stored and classified in a cloud, database, etc. What also makes each format unique is how we analyze them to derive valuable solutions.


Can you trust the data that you have collected? Is this data credible enough to glean insights from? Should we be basing our business decisions on the insights garnered from this data? All these questions and more, are answered when the veracity of the data is known.

Big data veracity refers to the assurance of quality or credibility of the collected data. Not all data is precise or consistent, and with the growth of big data, it's becoming harder to determine which data actually brings value.

Quality and accuracy are sometimes difficult to control when it comes to gathering big data. Since big data involves a multitude of data dimensions resulting from multiple data types and sources, there is a possibility that gathered data will come with some inconsistencies and uncertainties. That is why establishing the validity of data is a crucial step that needs to be conducted before data is to be processed. When processing big data sets, it is important that the validity of the data is checked before proceeding for processing.


This refers to a model's ability to represent reality. Model's by their very nature are idealized approximations of reality. Some are very good, others are all dangerously flawed. Frequently, model builders simplify their models in order for them to be computationally tractable. With hardware acceleration, we can remove these shackles from the model builder and let them simulate closer to reality.

With Big Data, we're not simply collecting a large number of records. We're collecting multidimensional data that spans a broadening array of variables. The secret is uncovering the latent, hidden relationships among these variables. Our first task is to assess the viability of that data because, with so many varieties of data and variables to consider in building an effective predictive model, we want to quickly and cost-effectively test and confirm a particular variable's relevance before investing in the creation of a fully featured model. And, like virtually all scientific disciplines, that process begins with a simple hypothesis.

We want to validate that hypothesis before we take further action and, in the process of determining the viability of a variable, we can expand our view to determine if other variables - those that were not part of our initial hypothesis - have a meaningful impact on our desired or observed outcomes. Make a hypothesis, test your hypothesis, and conclude.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs


Big data value refers to the usefulness of gathered data for your business. This term is defined as whatever is important to the customer. Another way to define value is the removal of obstacles in their path to allow them to get to their stated destination. We often think of value in terms of cost, but, we can also think of Value in terms of enablement and what that is worth to the customer.

Data by itself, regardless of its volume, usually isn't very useful - to be valuable, it needs to be converted into insights or information, and that is where data processing steps in. By using custom processing software, you can derive useful insights from gathered data, and that can add value to your decision-making process. However, the value of data is subjective. This means that, while something is valuable for one business or user, it can be worthless for another.

When we talk about the value of data, we usually talk about 2 of its final outputs:
- Its ability to generate cash flow
- Its ability to solve problems

After a significant investment in time and resources, if a company correctly uses big data, its ability to get to know customers and monetize all that information is enormous. They can offer customers what they want or need at the right time. Think of Amazon's recommender engine that recommends more items to you, or netflix's recommender engine, recommending movies to a specific user. This is only possible, because they both managed to generate high value from their Data.


Relevant information should not only exist, but should also be visible to the right person at the right time. A core task for any Big Data processing system is to transform the immense scale of it into something easily comprehended and actionable. For human consumption, one of the best methods for this is converting it into graphical formats. Spreadsheets and even three-dimensional visualizations are often not up to the task, however, due to the attributes of velocity and variety. There may be a multitude of spatial and temporal parameters and relationships between them to condense into visual forms. In a business context, appropriate visualization of data and dashboards is critical for the management to be able to extract value from their limited time, resources and even more limited attention span! Hence, big data must be visualized with appropriate tools that serve different parameters to help data scientists or analysts understand it better. However, plotting billions of data points is not an easy task. Furthermore, it associates different techniques like using treemaps, network diagrams, cone trees, etc.

The Road Ahead

Data is the oil of the 21st century and organizations today in different industries are realizing this quickly. While most organizations today do have the intent to use data, many are struggling to effectively capture, store, process or harness it. Big data is the technology that will continue to grow and develop. Advancements in emerging technologies like AI and machine learning will only make big data more valuable. More V's will evolve as the field will expand.

Support our effort by subscribing to our youtube channel. Update yourself with our latest video`s on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs
7 V's of Big Data

Corporate Scholarship Career Courses