This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.

What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?

75 Answers
Debidatta Dwibedi
Debidatta Dwibedi, works at Carnegie Mellon University

The following graphic nicely summarizes what all is involved in data science.

(from Data science)
Focus on three bubbles here: scientific method, math and statistics. These are aspects of data science that are closest to machine learning.

If I had to summarize machine learning in one sentence, I would say it is a collection of algorithms and techniques used to design systems that learn from data. But the algorithms of ML are very general in the sense usually they have a strong mathematical and statistical basis that does not take into account domain knowledge and data pre-processing. That is the key difference.

If you talk to a data scientist, they would tell you how after acquiring the data and they cleaned it(Data cleansing),transformed it into a useful form and then using domain knowledge decide what statistical method or ML algorithm will best able to solve the problem they are tackling. The above process may require certain amount of 'hacking' skills so as to fasten the process of having meaning data on which processing can be carried out. But a data scientist's job does not end there. Visualization is becoming a very important aspect. Representing data in a form which mere mortals can both understand and get valuable insights is as much a science as an art.

So a data-scientist needs to know about how to first decide which method of machine learning will best help him and how to apply that. He does not necessarily need to know how that method works. Although knowing that is always an asset.

There is a nice bit about the difference between ML and data mining on Machine learning:

These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:

  • Machine learning focuses on prediction, based on known properties learned from the training data.
  • Data mining (which is the analysis step of Knowledge Discovery in Databases) focuses on the discovery of (previously) unknown properties on the data.
Gam Dias
Gam Dias, Product Manager on NLP & Semantic Web products

Lots of good answers already - however the question is such that I think perhaps a business rather than technical description might be warranted.

First things first, doing stuff with data, whatever you want to call it is going to require some investment - fortunately the entry price has come right down and you can do pretty much all of this at home with a reasonably priced machine and online access to a host of free or purchased resources. Commercial organizations have realized that there is huge value hiding in the data and are employing the techniques you ask about to realize that value. Ultimately what all of this work produces is insights, things that you may not have known otherwise. Insights are the items of information that cause a change in behavior.

Let's begin with a real world example, looking at a farm that is growing strawberries (here's a simple backgrounder The Secret Life Of California's World-Class Strawberries, this High-Tech Greenhouse Yields Winter Strawberries , and this Growing Strawberry Plants Commercially)

What would a farmer need to consider if they are growing strawberries? The farmer will be selecting the types of plants, fertilizers, pesticides. Also looking at machinery, transportation, storage and labor. Weather, water supply and pestlience are also likely concerns. Ultimately the farmer is also investigating the market price so supply and demand and timing of the harvest (which will determine the dates to prepare the soil, to plant, to thin out the crop, to nurture and to harvest) are also concerns.

So the objective of all the data work is to create insights that will help the farmer make a set of decisions that will optimize their commercial growing operation.

Let's think about the data available to the farmer, here's a simplified breakdown:

1. Historic weather patterns

2. Plant breeding data and productivity for each strain

3. Fertilizer specifications

4. Pesticide specifications

5. Soil productivity data

6. Pest cycle data

7. Machinery cost, reliability, fault and cost data

8. Water supply data

9. Historic supply and demand data

10. Market spot price and futures data

Now to explain the definitions in context (with some made-up insights, so if you're a strawberry farmer, this might not be the best set of examples):

Big Data: Using all of the data available to provide new insights to a problem. Traditionally the farmer may have made their decisions based on only a few of the available data points, for example selecting the breeds of strawberries that had the highest yield for their soil and water table. The Big Data approach may show that the market price slightly earlier in the season is a lot higher and local weather patterns are such that a new breed variation of strawberry would do well. So the insight would be switching to a new breed would allow the farmer to take advantage of a higher prices earlier in the season, and the cost of labor, storage and transportation at that time would be slightly lower. There's another thing you might hear in the Big Data marketing hype: Volume, Velocity, Variety, Veracity - so there is a huge amount of data here, a lot of data is being generated each minute (so weather patterns, stock prices and machine sensors), and the data is liable to change at any time (e.g. a new source of social media data that is a great predictor for consumer demand),

Data Analysis: Analysis is really a heuristic activity, where scanning through all the data the analyst gains some insight. Looking at a single data set - say the one on machine reliability, I might be able to say that certain machines are expensive to purchase but have fewer general operational faults leading to less downtime and lower maintenance costs. There are other cheaper machines that are more costly in the long run. The farmer might not have enough working capital to afford the expensive machine and they would have to decide whether to purchase the cheaper machine and incur the additional maintenance costs and risk the downtime or to borrow money with the interest payment, to afford the expensive machine.

Data Analytics: Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. Looking at the weather data and pest data we see that there is a high correlation of a certain type of fungus when the humidity level reaches a certain point. The future weather projections for the next few months (during planting season) predict a low humidity level and therefore lowered risk of that fungus. For the farmer this might mean being able to plant a certain type of strawberry, higher yield, higher market price and not needing to purchase a certain fungicide.

Data Mining: this term was most widely used in the late 90's and early 00's when a business consolidated all of its data into an Enterprise Data Warehouse. All of that data was brought together to discover previously unknown trends, anomalies and correlations such as the famed 'beer and diapers' correlation (Diapers, Beer, and data science in retail). Going back to the strawberries, assuming that our farmer was a large conglomerate like Cargill, then all of the data above would be sitting ready for analysis in the warehouse so questions such as this could be answered with relative ease: What is the best time to harvest strawberries to get the highest market price? Given certain soil conditions and rainfall patterns at a location, what are the highest yielding strawberry breeds that we should grow?

Data Science: a combination of mathematics, statistics, programming, the context of the problem being solved, ingenious ways of capturing data that may not be being captured right now plus the ability to look at things 'differently' (like this Why UPS Trucks Don't Turn Left ) and of course the significant and necessary activity of cleansing, preparing and aligning the data. So in the strawberry industry we're going to be building some models that tell us when the optimal time is to sell, which gives us the time to harvest which gives us a combination of breeds to plant at various times to maximize overall yield. We might be short of consumer demand data - so maybe we figure out that when strawberry recipes are published online or on television, then demand goes up - and Tweets and Instagram or Facebook likes provide an indicator of demand. Then we need to align demand data up with market price to give us the final insights and maybe to create a way to drive up demand by promoting certain social media activity.

Machine Learning: this is one of the tools used by data scientist, where a model is created that mathematically describes a certain process and its outcomes, then the model provides recommendations and monitors the results once those recommendations are implemented and uses the results to improve the model. When Google provides a set of results for the search term "strawberry" people might click on the first 3 entries and ignore the 4th one - over time, that 4th entry will not appear as high in the results because the machine is learning what users are responding to. Applied to the farm, when the system creates recommendations for which breeds of strawberry to plant, and collects the results on the yeilds for each berry under various soil and weather conditions, machine learning will allow it to build a model that can make a better set of recommendations for the next growing season.

I am adding this next one because there seems to be some popular misconceptions as to what this means. My belief is that 'predictive' is much overused and hyped.

Predictive Analytics: Creating a quantitative model that allows an outcome to be predicted based on as much historical information as can be gathered. In this input data, there will be multiple variables to consider, some of which may be significant and others less significant in determining the outcome. The predictive model determines what signals in the data can be used to make an accurate prediction. The models become useful if there are certain variables than can be changed that will increase chances of a desired outcome. So what might be useful for our strawberry farmer to want to predict? Let's go back to the commercial strawberry grower who is selling product to grocery retailers and food manufacturers - the supply deals are in tens and hundreds of thousands of dollars and there is a large salesforce. How can they predict whether a deal is likely to close or not? To begin with, they could look at the history of that company and the quantities and frequencies of produce purchased over time, the most recent purchases being stronger indicators. They could then look at the salesperson's history of selling that product to those types of companies. Those are the obvious indicators. Less obvious ones would be the what competing growers are also bidding for the contract, perhaps certain competitors always win because they always undercut. How many visits the rep has paid to the prospective client over the year, how many emails and phone calls. How many product complaints has the prospective client made regarding product quality? Have all our deliveries been the correct quantity, delivered on time? All of these variables may contribute to the next deal being closed. If there is enough historical data, we can build a model that will predict that a deal will close or not. We can use a sample of the historic data set aside to test if the model works. If we are confident, then we can use it to predict the next deal

Aditya Singh
Aditya Singh, worked at Impulse Media
Usually data skills are divided into two broad categories -

1. Engineering Skills- Setting up database systems, writing queries, integrating with applications etc.

2. Analysis Skills- Can be very wide ranging from mathematical statistics, multivariate applied statistics, matrix algebra, data mining, machine learning etc.

A lot of Data Engineers and Architects have same skillsets (#1) but different work profiles. "Data Scientist" and "Data Analysts" have the same mission in an organization but usually have different skills (reasons below). Note that some organizations use both terms - scientists and analysts interchangeably which might add to some confusion.  We will go by how the four job profiles were invented and how *most* people use it.



Data Architect

Large enterprises generate a huge amounts of data from various different sources (grouped into two)

1. Internal Sources  - Existing systems (CRM, HRMS, Web Analytics etc.)
2. External Sources - Stock market feeds etc.

A Data Architect is someone who can understand all the sources of data and work out a plan for integrating, centralizing and maintaining all the data. He must be able to understand how the data relates to the current operations and the effects that any future process changes will have on the use of data in the organization. He needs to be able to have an end-to-end vision, and to see how a logical design will translate into one or more physical Databases, and how the Data will flow through the successive stages involved.

This may include things like designing relational databases, developing strategies for data acquisitions, archive recovery, and implementation of a database, cleaning and maintaining the database by removing and deleting old data etc.



Data Engineer

Data engineers are hard core engineers who know the internals of database softwares. He compiles and installs database systems, writes complex queries, scales it to multiple machines, ensures backups and puts disaster recovery systems in place. He usually has a deep knowledge and expertise in one or more different database softwares (SQL / NoSQL).



Data Analyst

The primary tasks of a data analyst are compilation and analysis of numerical information. They usually have a computer science and business degree. They get analytical insights out of all the data which an organization can have (Database softwares or just excel sheets) which makes sense for the organization and compile them into decent reports so that other non technical folks can understand and decide their course of action.

An analyst usually works to get analytical insights out of data and this job profile does not include working with statistics (usually) and has nothing to do with "BigData" in particular.

A decent mid-sized organization can have many analysts. For example - a sales analyst may look at all the sales in the past quarter and figure out a proper sales strategy (where to sell and whom to sell to maximize profits). He will then communicate the report to the leadership.



Data Scientist

"Data Scientist" is a very recent phenomenon and is usually associated with BigData. The overall mission of a scientist is same as an analyst but once the volume and velocity of data crosses a certain level, it requires really sophisticated skills to get those insights out.

A "Data Scientist" usually has many overlapping skills - Database Engineering, handling BigData systems like Hadoop OR Netezza, knowledge of Python/R and knowledge of statistics / data mining.

Whereas a traditional data analyst may look only at data from a single source (CRM etc.) a data scientist will most likely explore and examine data from multiple disparate sources. The data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can solve a business problem. Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization.

More about data scientist - A Arun Prasath 's answer - A Arun Prasath's answer to What is a data scientist?
Jordan Frank
Jordan Frank, Datamaker at Facebook
The way I see it, machine learning is concerned with algorithms whose performance at some task improves as it gains experience at that task, while data mining is concerned with analysing data for the purpose of discovering unforeseen patterns or properties.

So the similarities are obvious, they both look at data, and hope to extract something of value from it. As I see it, the main difference is whether the goal is to reproduce known knowledge (I know that some of these pictures are cats, and some are dogs, now can some algorithm learn that?), or if the goal is to discover unknown knowledge (is there any interesting structure in this data set?).  The two are, unsurprisingly, intertwined, as many of the properties or structure one may be searching for in data mining can be identified by machine learning algorithms. For instance, in data mining, one might be interested in determining if clusters of a certain form appear in the data, and could use a machine learning algorithm like k-means. K-means is a learning algorithm, in that if data has a known structure, it can learn it (under specific conditions, blah blah blah).

So data mining is exploratory, machine learning is focused on solving specific tasks well. That's my take on it, anyway.
Toshi Takeuchi
Toshi Takeuchi, Data Science Geek, Ex community TA for Coursera's Machine Learning course
Let's use the type of data itself to draw some comparison.

  • Do you deal with aggregated data?
  • Does your data includes both "good" and "bad" samples?

To do data mining, or machine learning, you need non-aggregated data that contains individual samples and those should include both positive and negative cases.

Let's say you want to detect fraud in financial dataset. You need individual transaction records that show examples of legitimate transactions and fraud. If you want to score lead quality, you need to retain both the leads that resulted in sales and those that didn't. This is because we need both examples to learn the difference.

However, in the traditional IT systems, we tend to store aggregated data with only "good" outcome, because people tend not perceive value in storing "bad" data: people say "What is the point of spending money in storing the record of fraudulent transactions or leads that didn't become customers?"

Data analytics, data analysis, and data science are broader terms and they may deal with all kinds of data, including aggregated one.

Big data starts with raw unaggregated data, but often it is used to produce aggregated summary, but it can be also used for data mining and machine learning.
View More Answers