University of Minnesota
Institute of Technology
myU OneStop


Electrical and Computer Engineering

Predictive Data Modeling and The Nature of Scientific Discovery

Vladimir Cherkassky
University of Minnesota
Electrical and Computer Engineering


Scientific discovery involves interaction between two major components:
     • Facts, or observations of the Real World (or Nature);
     • Scientific theories (models), i.e. mental constructs, explaining this observed data.
In classical science, the primary role belongs to a well-defined scientific hypothesis which
drives data collection and generation. So experimental data is simply used to confirm or
refute a scientific theory.

In late twentieth century, the balance between facts and models in scientific research has
totally shifted, due to growing use of digital technology for data collection and recording. 
Nowadays, there is an abundance of available data describing physical, biological and
social systems. Several new technologies, such as machine learning and data mining,
hold promise of ‘discovering’ new knowledge hidden in the sea of data. Much of recent
research in life sciences is data-driven, i.e. when researchers try to establish ‘associations’
between certain genetic variables and a disease. This is completely different from the
classical approach to scientific discovery. Whereas many machine learning and statistical
methods can easily detect correlations present in empirical data, it is not clear whether
such dependencies constitute new biological knowledge. This is known as the problem
of demarcation in the philosophy of science, i.e. differentiating between true scientific
theories and metaphysical theories (beliefs).

Knowledge that can be extracted from empirical data is statistical in nature, as opposed
to deterministic first-principle knowledge in classical science. Modern science is mainly
about such an empirical knowledge, yet there seems to be no clear demarcation between
true empirical knowledge and beliefs (supported by empirical data).

My talk will discuss methodological issues important for predictive data modeling, i.e.,
     • first-principle knowledge, empirical knowledge and beliefs;
     • understanding of uncertainty and risk,
     • predictive data modeling,
     • interpretation of predictive models.
These methodological issues are closely related to philosophical ideas, dating back to
Plato and Aristotle. The main points will be illustrated by specific examples from an
on-going project on prediction of transplant-related mortality for bone-and-marrow transplant
patients, in collaboration with the University of Minnesota Medical School and the Mayo Clinic.