The concept of big data, commonly characterized by volume, variety, velocity, and veracity, goes far beyond the data type and includes the aspects of data analysis, such as hypothesis-generating, rather than hypothesis-testing. Big data focuses on temporal stability of the association, rather than on causal relationship and underlying probability distribution assumptions are frequently not required. Medical big data as material to be analyzed has various features that are not only distinct from big data of other disciplines, but also distinct from traditional clinical epidemiology. Big data technology has many areas of application in healthcare, such as predictive modeling and clinical decision support, disease or safety surveillance, public health, and research. Big data analytics frequently exploits analytic methods developed in data mining, including classification, clustering, and regression. Medical big data analyses are complicated by many technical issues, such as missing values, curse of dimensionality, and bias control, and share the inherent limitations of observation study, namely the inability to test causality resulting from residual confounding and reverse causation. Recently, propensity score analysis and instrumental variable analysis have been introduced to overcome these limitations, and they have accomplished a great deal. Many challenges, such as the absence of evidence of practical benefits of big data, methodological issues including legal and ethical issues, and clinical integration and utility issues, must be overcome to realize the promise of medical big data as the fuel of a continuous learning healthcare system that will improve patient outcome and reduce waste in areas including nephrology.

Recent rapid increase in the generation of digital data and rapid development of computational science enable us to extract new insights from massive data sets, known as big data, in various disciplines, including internet business and finance. In the healthcare area, discovering new actionable insights has not been as common, although several success stories have been published in media and academic journals. This delayed progress of big data technology in the healthcare sector is a little bit odd, considering an earlier prediction that the application of big data technology was inevitable and that the healthcare sector would be one of the sectors expected to be benefited the most from big data technology [

The increasing gap between healthcare costs and outcomes is one of the most important issues, and many efforts to fill this gap are under way in many developed countries. The gap between healthcare costs and outcomes was analyzed to be the result of poor management of insights from research, poor usage of available evidence, and poor capture of care experience, all of which led to missed opportunities, wasted resources, and potential harm to patients. It has been suggested the gap could be overcome by the development of a “continuous learning healthcare system (

In this review, we discuss what is big data, what is special about medical big data, what is medical big data for, how medical big data can be analyzed, and what are the challenges for medical big data.

Big data are data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it [

The complexity of healthcare results from the diversity of health-related ailments and their co-morbidities; the heterogeneity of treatments and outcomes; and the subtle intricacies of study designs, analytical methods and approaches for collecting, processing, and interpreting healthcare data [

Medical big data have several distinctive features that are different from big data from other disciplines. Medical big data are frequently hard to access and most investigators in the medical arena are hesitant to practice open data science for reasons such as the risk of data misuse by other parties and lack of data-sharing incentives [

A big data project involves making sense out of all accumulated data on as many variables as possible due to increasing availability and decreasing expense of computing technology [

Medical big data can be broadly classified into three common forms, such as large

It has been pointed out that the pressing need to improve healthcare quality and patient outcomes, increasing data availability, and increasing analytic capabilities are three drivers of the big data era in healthcare, and that the potential of big data analytics application is improving the values of healthcare by improving outcomes and reducing waste in resources [

The potential value of medical big data has been demonstrated in: 1) the delivery of personalized medicine; 2) the use of clinical decision support systems such as automated analysis of medical images and the mining of medical literature; 3) tailoring diagnostic and treatment decisions and educational messages to support desired patient behaviors using mobile devices; 4) big data-driven population health analyses revealing patterns that might have been missed if smaller batches of uniformly formatted data had been analyzed instead; and 5) fraud detection and prevention [

Big data analysis exploits various algorithms of data mining, which can be defined as the automatic extraction of useful, often previously unknown information from large databases or datasets using advanced search techniques and algorithms to discover patterns and correlations in large pre-existing databases [

The algorithms of data mining are categorized as supervised, unsupervised, and semi-supervised learning. Supervised learning means to predict a known output of target, using a training set that includes already classified data to draw inference or classify prospective, testing data. In unsupervised learning, there is no output to predict, so analyzers try to find naturally occurring patterns or grouping within unlabeled data. Semi-supervised learning means to balance performance and precision using small sets of labeled or annotated data and a much larger unlabeled data collection [

Analytic goals of medical big data are prediction, modeling, and inference; classification, clustering, and regression are common methods exploited in these contexts [

Iavindrasana et al [

Medical big data have several issues related to the data themselves which although not specific to big data, needed to be considered during analyses. The issue of multiple comparison will not be discussed in this review.

Medical big data analytics deal with data collected for other purposes, such as patient care in the case of electronic medical records, and these data inherently have many variables with missing values.

Although the simplest and most overused way to handle missing values is to remove the cases with missing values, or complete-case analysis, it is valid only when missing values are assumed to be independent of both observed and unobserved data (see below). This assumption is not realistic in most situations. Therefore, complete-case analysis in these cases may bias the conclusion. Another major drawback of the complete-case analysis is that reducing the number of data points available for analysis generally is very inefficient [

Missingness may exhibit various relationships with data already observed or unobserved data. Missing data are classified into three types: 1) missing completely at random (MCAR), 2) missing at random (MAR), and 3) not missing at random (NMAR). MCAR is missingness of which probability does not depend on either observed or unobserved data. If data are MCAR, the probability of a missing observation is the same for all entities. In these situations, complete-case analysis does not bias the scientific inference. This is rarely met in practice. MAR is missingness of which probability does not depend on unobserved data but depend on observed data. In these cases, the process of missingness should be adjusted for all the variables that affect the probability of missingness. NMAR is missingness of which probability depends on unobserved data. There are many tool kits to handle these types of missingness including NMAR, such as in SAS, R, Stata, and WinBUGS [

High dimensional data are data with too many attributes compared to the number of observational units. Microarray data or next generation sequencing data are typically high dimensional datasets. In high-dimension datasets, many numerical analyses, data sampling protocols, combinatorial inference, machine learning methods, and data managing processes are susceptible to the “curse of dimensionality” [

Sparsity, multicollinearity, model complexity, computational cost to fit model, and model overfitting are the issues accompanied by high dimensional datasets [

Randomized controlled trials minimize bias and control confounding and are therefore considered the gold standard of design validity [

Big data analyses on various data from administrative claim database or national registries can be used to overcome these limitations. Big data studies provide real-world healthcare information from a broader, population-based perspective. Administrative claim data have broad generalizability, large numbers of patient records, and less attrition than clinical trials; they are faster and less costly than primary data collection, and can often be linked with other datasets [

Although the potential of big data analytics is promising, assessing the “state of science” and recognizing that, at present, the application of big data analytics is largely promissory is important [

This work was funded by the Korea Meteorological Administration Research and Development Program (grant number KMIPA 2015-5120).

All authors have no conflicts of interest to declare.

A continuous learning healthcare system.

Medical big data analysis vs. classical statistical analysis

Medical big data analysis | Classical statistical analysis | |
---|---|---|

Application | Hypothesis-generating | Hypothesis-testing |

Questions of interest | Overcoming the limitation of locally or temporally stable association with continually updating the data and algorithm | Trying to prove causal relationships |

Domain knowledge | More important in interpretation of the results | Important both in collection of data and interpretation of the results |

Sources of data | Any kind of sources; frequently multiple sources | Carefully specified collection of data; usually single source |

Data collection | Recording without the direct supervision of a human | Human-based measurement recording |

Coverage of data to be analyzed | Substantial fraction of entire population | Small data samples from a specific population with some assumptions of their distribution |

Data size | Frequently huge | Relatively small |

Nature of data | Unstructured and structured | Mainly structured |

Data quality | Rarely clean | Quality controlled |

Research questions of data analysis | May be different from those of data collection | Same as those of data collection |

Underlying assumption of the model | Frequently absent | Based on various underlying probability distribution function |

Analytic tools | Frequently automated with data mining algorithm | Manually by expert with classical statistics |

Main outputs of analysis | Prediction, models, patterns identified | Statistical score contrasted against random chance |

Privacy & ethics | Concerns about privacy and ethical issues | Data collection according to the pre-approved protocol; informed consent from the participants |