Data Mining

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 5, 2021

Enjoy listening to this Blog while you are working with something else !


What is Data Mining

Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more.
Data mining is a cornerstone of analytics, helping you develop the models that can uncover connections within millions or billions of records.

Data mining depends on effective data collection, warehousing, and computer processing.

The data sources can include databases, data warehouses, the web, and other information repositories or data that are streamed into the system dynamically.

Data Mining History & Current Advances

The process of digging through data to discover hidden connections and predict future trends has a long history. Sometimes referred to as "knowledge discovery in databases," the term "data mining" wasn't coined until the 1990s. But its foundation comprises three intertwined scientific disciplines, namely:

1.) Statistics
Statistics are the foundation of most technologies on which data mining is built, e.g. regression analysis, standard distribution, standard deviation, standard variance, discriminate analysis, cluster analysis, and confidence intervals. All of these are used to study data and data relationships.

2.) Artificial Intelligence
Artificial intelligence, or AI, which is built upon heuristics as opposed to statistics, attempts to apply human-thought-like processing to statistical problems. Certain AI concepts which were adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS).

3.) Machine Learning
Machine learning is the union of statistics and AI. It could be considered an evolution of AI, because it blends AI heuristics with advanced statistical analysis. Machine learning attempts to let computer programs learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its goals.

What was old is new again, as data mining technology keeps evolving to keep pace with the limitless potential of big data and affordable computing power.

Why is data mining important?

So why is data mining important? You've seen the staggering numbers - the volume of data produced is doubling every two years. Unstructured data alone makes up 90 percent of the digital universe. But more information does not necessarily mean more knowledge.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Data Mining
Data Mining

Data Mining Process

The data mining process is divided into two parts i.e. Data Preprocessing and Data Mining. Data Preprocessing involves data cleaning, data integration, data reduction, and data transformation. The data mining part performs data mining, pattern evaluation and knowledge representation of data. There are many factors that determine the usefulness of data such as accuracy, completeness, consistency, timeliness. The data has to quality if it satisfies the intended purpose. Thus preprocessing is crucial in the data mining process.

#1) Data Cleaning
Data cleaning is the first step in data mining. It holds importance as dirty data if used directly in mining can cause confusion in procedures and produce inaccurate results.
Basically, this step involves the removal of noisy or incomplete data from the collection. Many methods that generally clean data by itself are available but they are not robust.
This step carries out the routine cleaning work by:
(i) Fill The Missing Data:
(ii) Remove The Noisy Data: Random error is called noisy data.

#2) Data Integration
When multiple heterogeneous data sources such as databases, data cubes or files are combined for analysis, this process is called data integration. This can help in improving the accuracy and speed of the data mining process.
Different databases have different naming conventions of variables, by causing redundancies in the databases. Additional Data Cleaning can be performed to remove the redundancies and inconsistencies from the data integration without affecting the reliability of data.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service Integrator and Microsoft SQL etc.

#3) Data Reduction
This technique is applied to obtain relevant data for analysis from the collection of data. The size of the representation is much smaller in volume while maintaining integrity. Data Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network, etc.

Some strategies of data reduction are:
- Dimensionality Reduction: Reducing the number of attributes in the dataset.
- Numerosity Reduction: Replacing the original data volume by smaller forms of data representation.
- Data Compression: Compressed representation of the original data.

#4) Data Transformation
In this process, data is transformed into a form suitable for the data mining process. Data is consolidated so that the mining process is more efficient and the patterns are easier to understand. Data Transformation involves Data Mapping and code generation process.

Strategies for data transformation are:
Smoothing: Removing noise from data using clustering, regression techniques, etc.
Aggregation: Summary operations are applied to data.
Normalization: Scaling of data to fall within a smaller range.
Discretization: Raw values of numeric data are replaced by intervals. For Example, Age.

#5) Data Mining
Data Mining is a process to identify interesting patterns and knowledge from a large amount of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is represented in the form of patterns and models are structured using classification and clustering techniques.

#6) Pattern Evaluation
This step involves identifying interesting patterns representing the knowledge based on interestingness measures. Data summarization and visualization methods are used to make the data understandable by the user.

#7) Knowledge Representation
Knowledge representation is a step where data visualization and knowledge representation tools are used to represent the mined data. Data is visualized in the form of reports, tables, etc.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Data Mining Techniques

The purpose of data mining is twofold:
i) the creation of predictive power using the current information for predicting future values,
ii) finding descriptive power for a better description of patterns in the present data.
So what data mining techniques do analysts use?

Classification This data analysis is implemented to regain vital and actual information. It's considered to be a complex data method among other data mining techniques. Information is classified into different classes. For instance, credit customers can be classified according to three risk categories: 'low,' 'medium,' or 'high.'

Clustering Cluster analysis is a bit different classifying in the sense that here the pieces are grouped according to their similarities. For instance, different groups of customers are clustered together to find similarities and dissimilarities between the strands of information about them.

Regression This data mining tool is designed to pinpoint and analyze the interactions between different variables. It's used for identification of the probability of a particular variable from other variables' existence. This method is also known as predictive power.
Regression analysis is also used to foresee the future value of a specific entity (the given feature could be either linear or nonlinear). Regression techniques are rather advantageous, due to the power of neural networks which is a unique method that emulates the neural signals in the brain. Ultimately the goal of regression is to show the links between two pieces of information in one set.

Association This mining data technique is used to find an association between two or more events or properties. It drills down to an underlying model in the database systems. Somewhat similar to buying a laptop - you are immediately offered to buy a bag to go with it.

Outlier detection (Outlier analysis) This a process of identifying certain anomalies (outliers) in the data set. You need to be able to explain why there are these outliers amidst the all-encompassing pattern. For example, among your male audience of buyers, you have a sudden peak in female buying activity.

Prediction Prediction is considered to be an essential data mining technique. We all want to know the future value of our investments and to be protected from fraudulent crooks while online shopping. So it's applied to forecast different types of data mining in the days to come. Analysis of the previous events can help to project more or less accurate predictions tomorrow.
You never know if a person will be honest two days from now but based on their previous credit history, you can surmise that if they've been people of integrity so far, then probably they will continue in their honest dealings with the bank for the months to come. Do you remember receiving a call from the bank clerk asking, do you want your credit limit increased? Well, that always sounds pleasant to be a trustworthy person.

Sequential patterns This type of data analysis seeks to find out the same models, regularities or transaction tendencies in informational strands over a specified period. In sales, businesses can identify when some items are bought together during a particular season of the year. Based on this, companies offer better deals to those clients that have an actual purchasing history.

Decision trees This type of data mining tool is used quite often as it's the simplest for understanding. At the root of such decision trees, there is a simple question with many possible answers. Based on the responses, we can get the final answer to the central question. For example, we can attempt to respond to the following question: Should we play golf today?

Neural networks A neural network is a specific type of machine learning model that is often used with AI and deep learning. Named after the fact that they have different layers which resemble the way neurons work in the human brain, neural networks are one of the more accurate machine learning models used today.
Although a neural network can be a powerful tool in data mining, organizations should take caution when using it: some of these neural network models are incredibly complex, which makes it difficult to understand how a neural network determined an output.

Long-term memory processing Long term memory processing refers to the ability to analyze data over extended periods of time. The historic data stored in data warehouses is useful for this purpose. When an organization can perform analytics on an extended period of time, it's able to identify patterns that otherwise might be too subtle to detect. For example, by analyzing attrition over a period of several years, an organization may find subtle clues that could lead to reducing churn in finance.

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Data Mining

Corporate Scholarship Career Courses