The Importance of Reproducibility in High-Throughput Biology: Case Studies in Forensic Bioinformatics
Keith A. Baggerly, PhD
Professor, Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences
University of Texas MD Anderson Cancer Center
Modern high-throughput biological assays let us ask detailed questions about how diseases operate, and promise to let us personalize therapy. Our intuition about what the answers “should” look like in high dimensions is very poor, so careful data processing is essential. When documentation of such processing is absent or incomplete, we must apply “forensic bioinformatics” to work backwards from the raw data and the reported results to infer what the methods must have been. Such explorations occasionally reveal errors. The most common errors we uncover are simple ones, often involving mislabeling of rows, columns, or variables. These errors are easy to make, but if documentation is adequate, they may be easy to fix. Incomplete documentation is, however, pervasive in much of the scientific literature. Fortunately, new tools (many from the open-source community) have been introduced in the past few years which make documentation much easier.
Dr. Baggerly is a professor of bioinformatics and computational biology at the UT MD Anderson Cancer Center, where he's worked since 2000. He is a statistician by training and a fellow of the American Statistical Association. He is perhaps best known for his explorations in “forensic bioinformatics” using deep dives into raw data to identify key features of the underlying experiments. His work has been covered by Science, Nature, and the New York Times. Today, like the rest of us, he's still struggling with how to best use complex assay data to improve patient care.