Steve Lohr on the Origins of the Term “Big Data”

Data hounds will appreciate reading Steve Lohr’s concise but informative article in the February 1 edition of the New York Times, in which he takes a look at the origins of the moniker “big data.” It’s fun insofar as the term has drifted into common parlance after being mentioned here and there, but it may not be so easy to find a single individual whom to credit for its creation. The first time I ever regarded it seriously was when it appeared in a NBER Working Paper that addressed future career opportunities for economists in big data (I’ll add the cite once I track down again).

It reminds me of a local story involving moniker-manufacturing on a grand scale. During the late 1970s, The Oakland-Berkeley regional newspaper East Bay Express published an article by humorist Alice Kahn. In the article, Ms. Kahn coined the term “Yuppie.”  So far as anyone could tell, she was the first person to use the term, which meme-exploded across the USA in a few months. In subsequent issues The Express she turned it into an ongoing gag, because everybody she knew kept telling her, “We think you should sue” –for rights to the term. Humor being an “open source” product first and foremost, she didn’t sue, but did “work it” for what it was worth.

Back to big data.  Here’s a quote from the article, given by Fred R. Shapiro, Associate Librarian at Yale Law School and editor of the Yale Book of Quotations:

“The Web…opens up new terrain.What you’re seeing is a marriage of structured databases and novel, less structured materials. It can be a powerful tool to see far more.”

This is exactly the point that Autonomy and other e-discovery firms such as Recommind make:  to analyze the full output of a given company, corporation or legal case, you now have to look at all of the data. That includes the easier-to-parse world of structured data, but more and more it includes social media, email, recorded telephone conversations and many other casual (but critical) information resources.


Big Data Meets Literary Scholarship

The New York Times published a very interesting update on how humanists are applying big data approaches to their scholarship (see The New York Times, January 27, 2013, p. B3). The article begins with a description of research by Matthew L. Jockers at the University of Nebraska-Lincoln. He conducted word- and phrase-level textual analysis on thousands of novels, enabling longer-term patterns to emerge in how authors use words and find inspiration. This kind of textual analysis revealed the impact of a few major authors on many others, and identified the outsized impact of Jane Austen and Sir Walter Scott.

Jockers said that “Traditionally, literary history was done by studying a relative handful of texts…what this technology does is let you see the big picture–the context in which a writer worked–on a scale we’ve never seen before.”

The implications for comparative literature and other fields that bump up against disciplinary boundaries are compelling.This kind of data analysis has long been the domain of sociologists, linguists and other social scientists, but it is increasingly finding a home in the humanities.

Steve Lohr, the Times article’s author, provides a number of other examples. One of my favorites is the research conducted by Jean Baptiste Michel and Erez Lieberman Aiden, who are based at Harvard. They utilized Google Books’ graph utility–open to the public–to chart the evolution of word use over long periods of time. One interesting example: for centuries, the references to “men” vastly outnumbered references to “women,” but in 1985 references to women began to lead references to men (Betty Friedan, are you there?)

Studying literature on this scale is indicative of the power and potential of big data to revolutionize how scholarship is done. Indeed, the availability of useful data is subtly transforming humanist scholars to the point that interested humanists are gaining a new identity as computer programmers.

Lohr also points out that quantitative methods are most effective when experts with deep knowledge of the subject matter guide the analysis, and even second-guess the algorithms.

What is new and distinctive is the ability to ramp up the study of a few texts to a few hundred text. The trick will be to keep the “humanity” in humanism.

I also draw considerable inspiration from the growing awareness that pattern recognition–a daily exercise for information professionals–is gaining new attention as part of the research process in general.

Perhaps it’s time for some of us to collaborate as co-principal investigators….

Metrics and Management: New Book, New Implications

The Org: The Underlying Logic of the Office, by Ray Fisman and Tim Sullivan (Twelve, 2013), was featured today on first page of the New York Times Business Day section. The article starts out by comparing British Petroleum’s record as a government enterprise, and then later as private corporation. (Pop quiz: what are the two biggest disasters BP created, and when did they occur?).  This alone is intriguing and suggests the book is a good read, but the following quotes really caught my attention:

“The more we reward those things we can measure, and not reward the things we care about but don’t measure, the more we will distort behavior.”  -Burton Weisbrod, Northwestern University.

“If what gets measured is what gets managed, then what gets managed is what gets done.,”  –-Fisman and Sullivan

–With the implication that what is not managed not only will not get done, but may go wrong in unforeseen ways (Deep Water, anyone?).

These insights apply to bibliometrics as well as to management, particularly when it comes to measuring the quality and impact of library services that are not (sufficiently) measured.