Research

Materials Science and Engineering at MIT (PDF)
The Scientific Data Flood: A Case Study of "How Much Information?"

Stuart Madnick, John Norris Maguire Professor of Information Technology, MIT Sloan School of Management & Professor of Engineering Systems, MIT School of Engineering

MacKenzie Smith, Associate Director of Technology, MIT Libraries

Kate Clopeck, Masters of Science, Technology and Policy Program, MIT

June 2009

Abstract:
This case study gives examples of how data is created and stored by material scientists and engineers at MIT. The amount of data depends on specific research goals and the tools, experimental techniques, and computational methods employed by the individual researcher. Both simulation and experiments are used, with the simulations producing more data in the cases reported here. The ratio between computation and data production varies widely. For example, a hundred million-atom simulation might produce only a few kilobytes of data. However, if the researcher wants to track the system at every time step, a much “smaller” simulation (fewer atoms) could generate petabytes of data. Data is retained very differently in different labs. For example, in one lab, research data is stored on the students’ and postdocs’ personal computers, with each person in charge of the data they generate. The first author listed on the final publication is responsible for backing up the data onto a CD or portable hard drive at the time of publication. Each year, the faculty member assigns one of her students to purge old data. Other papers examine other labs at MIT.