|
What is Data Mining (Data Science)?
`We are drowning in information but starved for knowledge.'
John Naisbitt
Data mining (now rebranded as data science) is the non-trivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns or
structures or models
or trends or relationships in data to enable data-driven decision making.
What is meant by these terms?
- `Non-trivial': it is not a
straightforward computation of predefined quantities like computing the
average value of a set of numbers.
- `Valid': the patterns hold in general, i.e. being valid on
new data in the face of uncertainty.
- `Novel': the patterns were not known beforehand.
- `Potentially useful': lead to some benefit to the user.
- `Understandable': the patterns are interpretable and comprehensible -
if not immediately then after some postprocessing.
Is data mining (data science) `statistical déjà vu'?
Statistics is the science of learning from data (or making sense out of
data), and of measuring, controlling and
communicating uncertainty.
If you want to know more about what statistics is, please click
here.
Like statistical thinking and statistics, data mining (data science) is not only modelling and
prediction, nor a product that can be bought, but a
whole iterative problem solving cycle/process that must
be mastered through interdisciplinary and transdisciplinary team effort.
Data mining (data science) projects are not simple. They usually start with high
expectations but may end in failure if the engaged team is not guided by a
clear methodological framework. We follow a methodology called
CRISP-DM (`CRoss Industry Standard Process for Data
Mining'). If you want to know
more about CRISP-DM, please click
here.
`Coming together is a beginning. Keeping together is progress. Working together is success.'
Henry Ford
What distinguishes data mining (data science) from statistics?
Statistics traditionally is concerned
with analysing primary (e.g. experimental)
data that have been collected to explain and check the validity of specific existing ideas
(hypotheses). As such statistics is
`primary data analysis', top-down (explanatory and confirmatory) analysis or
`idea (hypothesis) evaluation or testing.
Data mining (data science), on the other hand, typically is concerned with analysing
secondary (e.g. observational or `found') data that have been collected for other
reasons (and not `under control' of the investigator). The usage of these
data is to create new ideas (hypotheses).
As such data mining (data science) is
`secondary data analysis', bottom-up (exploratory and predictive) analysis,
`idea (hypothesis) generation' (or `knowledge discovery').
The two approaches of `learning from data' or `turning
data into knowledge' are complementary and should proceed side by
side - in order to enable proper data-driven decision making.
- The information obtained from a bottom-up analysis, which identifies
important relations and tendencies, can not explain why these discoveries are useful
and to what extent they are valid.
The confirmatory tools of top-down analysis
need to be used to confirm the discoveries and evaluate the quality of decisions based
on those discoveries.
- Performing a top-down analysis, we think
up possible explanations for the
observed behaviour and let those hypotheses dictate the data to be
analysed.
Then, performing a bottom-up analysis, we let the data suggest new
hypotheses (ideas) to test.
We already applied this
complementary view several times successfully
within client projects.
For example, when historical data were available the idea to be
generated from a bottom-up analysis (e.g. using a mixture of so-called `ensemble techniques')
was `which are the most important (from a predictive point of view)
factors (among a `large' list of candidate factors) that impact a given
process output (or a given KPI, `Key Performance Indicator')'. Mixed with subject-matter knowledge this
idea resulted in a list of a 'small' number of factors (i.e. `the critical ones'). The
confirmatory tools of top-down analysis (statistical `Design Of Experiments',
DOE, in most of the cases) was then used to confirm and evaluate the
idea. By doing this, new data will be collected (about `all' factors) and
a bottom-up analysis could be applied again - letting the data suggest
new ideas to test.
Want to know more about the relation between data mining (data science) and statistics?
Check out some additional papers in our `Publications' section.
Interested in our data mining (data science) services?
Are you drowning in uncertainty and starving for
knowledge? Interested to get Statooed?
Have a question about our data mining (data science) services? Contact us to allow us
to help you.
|
|