Tool detects patterns hidden in vast data sets | Broad Institute of MIT and Harvard
Researchers from the Broad Institute and Harvard University have developed a tool that can tackle large data sets in a way that no other software program can. Part of a suite of statistical tools called MINE, it can tease out multiple patterns hidden in health information from around the globe, statistics amassed from a season of major league baseball, data on the changing bacterial landscape of the gut, and much more.
nsf.gov - National Science Foundation (NSF) News - Tool Enables Scientists to Uncover Patterns in Vast Data Sets - US National Science Foundation (NSF)
Relationships discovered in data will shed light on vexing problems and increase human understandingOne of the greatest strengths of this newly discovered tool within MINE is its ability to detect and analyze a broad spectrum of patterns and characterize them according to a number of different parameters a researcher might be interested in. Other statistical tools work well for searching for a specific pattern in a large data set, but cannot score and compare different kinds of possible relationships. Researchers can also use MINE to generate new ideas and connections.
Learn more about MINE, MIC and patterns identified in biological and health data, as well as statistics from the 2008 baseball season by visiting the Broad Institute website. A video about this work also is available on the website.
This graphic depicts the top 0.25 percent of the relationships that the researchers' techniques found in data on the concentration of microbes in the human gut.
Image courtesy of David Reshef
The procedure that their algorithm follows can be interpreted visually. Effectively, the algorithm considers every pair of variables in a dataset and plots them against each other. It then overlays each graph with a series of denser and denser grids and identifies the grid cells that contain data points. Using principles borrowed from information theory, the algorithm assesses how orderly the patterns produced by the data-containing cells are. The score for each pair of variables is based on the score of its most orderly pattern.
“The fundamental idea behind this approach is that if a pattern exists in the data, there will be some gridding that can capture it,” Reshef says. And because the cells in a grid can track a curve as easily as they can a straight line, the method isn’t tied to any particular type of relationship.
Related