Anyone with a personal computer knows about the problem with data, not least its tendency to grow at an unnerving rate, take up space, demand upgrades, and clog bandwidth as it gets moved around. Emails and tweets are data, so are picture files, music tracks, and movies. Sometimes we store it, or it flows through our world as if a stream. Data is on the move.
We pull data off the web, it gets pushed at our phones, tv sets, mobile computers and workstations. We generate it as well, even without knowing. Then there’s all that data of which we are scarcely aware but affects us anyway –– financial, travel, meteorological, surveillance, medical, remote sensing, monitoring, security and scientific data.
I remember when people used to talk about this mass of data as a problem — the scale of it, its demands on storage and bandwidth, its redundancy, its inconsistency, and its heterogeneity (it’s in many different formats). But since about 2008 the data maelstrom has changed its hue. That to me is the Big Data revolution.
Instead of worrying about where you store all this data, we now have “the cloud.” The data gets stored, backed up and ferried between massive file servers, and data users don’t have to know how or where that happens.
The scale, flow, and heterogeneity of all that data now constitutes a positive challenge, the mastery of which is already yielding results in commerce. More significantly it’s starting to show results in urban and environmental control systems, the sciences, policy formation, and aspects of design.
It’s apparent that all that data also has uses beyond its immediate purpose. All those retail transactions don’t only move goods and money (IOUs) around, but can be mined for information about consumer habits in general. Medical records contain information about individuals. They can also be correlated with data from other sources to yield insights on the state of people’s health, and the distribution of diseases over time and space.
A 2008 US report by Tony Hey and colleagues called The Fourth Paradigm: Data-Intensive Scientific Discovery is available as a PDF online. It makes the case for improving digital infrastructures to support the flow, storage and analysis of massive quantities of scientific data.
A few of the articles in that volume reference an article in Wired Magazine by the then Chief Editor Chris Anderson. Most commentators see the article sympathetically as a provocation. It overstates the case for taking data in quantity as ushering in a step change in thinking about how knowledge growth and science happen.
Data versus theory
The strong claim is that big data constitutes a repository developed without bias or favour, to be deployed for whatever purpose we choose. The data is there, streaming around us as if automatically to be captured and accessed, even if we can’t yet think of a use for it, or what it might show.
Data is neutral. If there’s enough of it then it covers every individual case and accommodates just about every way of looking at the phenomena it represents. Statisticians don’t need to sample. They have the complete set. It can be probed to identify the most typical, marginal and exceptional, and even provide generalisations, theories and models, though these are no longer necessary.
Anderson’s article tackles the status of theories. Theories are an efficient substitutes for stored data, but they leave out a lot — all those exceptional cases. But now we can retain the raw data to be interrogated over and over again.
The health care singularity
Big data doesn’t presume theories, and the biases they entail. It ought to be freely available (anonymised) and available to different communities of researchers — hence the attention to open access to data. Big data is democratic. It also lends itself to crowd sourcing, the recruitment of armies of volunteers to generate, interpret and deploy, and giving everyone access.
As a case in point, Gillam et al refer to the: “health care singularity.” This is the moment when medical knowledge becomes “liquid” (61). Vast numbers of patient records will aggregate in patient data clouds there to be tapped, and from which hypotheses can be extracted and predictions made.
This ascendancy of data will also change academic research practices. Gillam et al assert: “To enable instantaneous translation, journal articles will consist of not only words, but also bits. Text will commingle with code, and articles will be considered complete only if they include algorithms” (61).
The Big Data exemplars are familiar to most of us observers of digital media — the flow of Twitter feeds, Facebook news feeds, bus and train travel information, airline schedules, real-time weather information. But that’s just the tip of a massive iceberg, or glacial tributary, or spray from a deluge of data gathering from the natural environment, oceans, cities, people, outer space, etc etc. Next post: What’s wrong with Big Data metaphysics.
- Anderson, Chris. 2008. The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine, (16) 07, http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.
- Bowker, Geoffrey C. 2014. The theory/data thing. International Journal of Communication, (8)1795-1799.
- Gillam, Michael, Craig Feied, Jonathan Handler, Eliza Moody, Ben Schneiderman, Catherine Plaisant, Mark Smith, and John Dickason. 2009. The healthcare singularity and the age of semantic medicine. In Tony Hey, Stewart Tansley, and Kristin Tolle (eds.), The Fourth Paradigm: Data-Intensive Scientific Discovery: 57-64. Redmond, WA: Microsoft Research. PDF
- Hargittai, Eszter. 2015. Is bigger always better? Potential bias of big data derived from social network sites. Annals , AAPSS, (659)63-76.
- Kitchin, Rob. 2014. The real-time city? Big data and smart urbanism. GeoJournal, (79)1-14.
- Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The parable of Google flu: Traps in big data analysis. Science, (343) March, 1203-1205.
- Lyon, David. 2014. Surveillance, Snowden, and Big Data: Capacities, consequences, critique. Big Data and Society, (DOI: 10.1177/2053951714541861)1-13.
- Marquet, Pablo A., Andrew P. Allen, James H. Brown, Jennifer A. Dunne, Brian J. E Nquist, James F. Gillooly, Patricia A. Gowaty, Jessica L. Green, John Harte, Steve P. Hubbell, James O’dwyer, Jordan G. Okie, Annette Ostling, Mark Ritchie, David Storch, and Geoffrey B. West. 2014. On theory in ecology. BioScience, (64)701-710.
- Miller, Harvey J., and Michael F. Goodchild. 2015. Data-driven geography. GeoJournal, (80)449-461.
- Rabaria, Chirag, and Michael Storpera. 2014. The digital skin of cities: urban theory and research in the age of the sensored and metered city, ubiquitous computing and big data. Cambridge Journal of Regions, Economy and Society, (doi:10.1093/cjres/rsu021)1-16.