The Promises and the Challenges of Big Social Data
text author: Lev Manovich
article version: 1
posted March 31, 2011
[This is the first part of a longer article – the second part will be posted in the next few days]
The emergence of social media in the middle of 2000s created a radically new opportunity to study social and cultural processes and dynamics. For the first time, we can follow imagination, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, eyes drop on the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their tracklists, and follow their trajectories in physical space.
In the 20th century, the study of the social and the cultural relied on two types of data: “surface data” about many (sociology, economics, political science) and “deep data” about a few (psychology, psychoanalysis, anthropology, ethnography, art history; methods such as “thick description” and “close reading”). For example, a sociologist worked with census data that covered most of the country’s citizen; however, this data was collected only every 10 year and it represented each individual only on a “macro” level, living out her/his opinions, feelings, tastes, moods, and motivations. In contrast, a psychologist was engaged with a single patient for years, tracking and interpreting exactly the kind of data which census did not capture.
In the middle between these two methodologies of “surface data” and “deep data” was statistics and the concept of sampling. By carefully choosing her sample, a researcher could expand certain types of data about the few into the knowledge about the many. For example, starting in 1950s, Nielsen Company collected TV viewing data in a sample of American homes (via diaries and special devices connected to TV sets in 25,000 homes), and then used this sample data to tell TV networks their ratings for the whole country for a particular show (i.e. percentage of the population which watched this show). But the use of samples to learn about larger populations had many limitations.
For instance, in the example of Nelson’s TV ratings, the small sample used did not tell us anything about the actual hour by hour, day to day patterns of TV viewing of every individual or every family outside of this sample. Maybe certain people watched only news the whole day; others only tuned in to concerts; others had TV on not never paid attention to it; still others happen to prefer the shows which got very low ratings by the sample group; and so on. The sample data could not tell any of this. It was also possible that a particular TV program would get zero shares because nobody in the sample audience happened to watch it – and in fact, this happened more than once.
Think of what happens then you take a low-res image and make it many times bigger. For example, lets say you stat with 10x10 pixel image (100 pixels in total) and resize it to 1000x1000 (one million pixels in total). You don’t get any new details – only larger pixels. This is exactly happens when you use a small sample to predict the behavior of a much larger population. A “pixel” which originally represented one person comes to represent 1000 people who all assumed to behave in exactly the same way.
The rise of social media along with the progress in computational tools that can process massive amounts of data makes possible a fundamentally new approach for the study of human beings and society. We no longer have to choose between data size and data depth. We can study exact trajectories formed by billions of cultural expressions, experiences, texts, and links. The detailed knowledge and insights that before can only be reached about a few can now be reached about many – very, very many.
In 2007, Bruno Latour summarized these developments as follows: “The precise forces that mould our subjectivities and the precise characters that furnish our imaginations are all open to inquiries by the social sciences. It is as if the inner workings of private worlds have been pried open because their inputs and outputs have become thoroughly traceable.” (Bruno Latour, “Beware, your imagination leaves digital traces”, Times Higher Education Literary Supplement, April 6, 2007.)
Two years earlier, in 2005, Nathan Eagle at MIT Media Lab already was thinking along the similar lines. He and Alex Pentland put up a web site “reality mining” (reality.media.mit.edu) and wrote how the new possibilities of capturing details of peoples’ daily behavior and communication via mobile phones can create “Sociology in the 21st century.” To put this idea into practice, they distributed Nokia phones with special software to 100 MIT students who then used these phones for 9 months – which generated approximately 60 years of “continuous data on daily human behavior”.
Finally, think of Google search. Google’s algorithms analyze text on all web pages they can find, plus “PDF, Word documents, Excel spreadsheets, Flash SWF, plain text files, and so on,” and, since 2009, Facebook and Twitter content. (en.wikipedia.org/wiki/Google_Search). Currently Google does not offer any product that would allow a user to analyze patterns in all this data the way Google Trends does with search queries and Google’s Ngram Viewer does with digitized books – but it is certainly technologically conceivable. Imagine being able to study the collective intellectual space of the whole planet, seeing how ideas emerge and diffuse, burst and die, how they get linked together, and so on – across the data set estimated to contain at least 14.55 billion pages (as of March 31, 2011; see worldwidewebsize.com).
Does all this sounds exiting? It certainly does. What maybe wrong with these arguments? Quite a few things.
[The second part of the article will be posted here within next few days.]
I am grateful to UCSD faculty member James Fowler for an inspiring conversation a few years about the depth/surface questions. See his pioneering social science research at jhfowler.ucsd.edu.