Big Data Meets Literary Scholarship

The New York Times published a very interesting update on how humanists are applying big data approaches to their scholarship (see The New York Times, January 27, 2013, p. B3). The article begins with a description of research by Matthew L. Jockers at the University of Nebraska-Lincoln. He conducted word- and phrase-level textual analysis on thousands of novels, enabling longer-term patterns to emerge in how authors use words and find inspiration. This kind of textual analysis revealed the impact of a few major authors on many others, and identified the outsized impact of Jane Austen and Sir Walter Scott.

Jockers said that “Traditionally, literary history was done by studying a relative handful of texts…what this technology does is let you see the big picture–the context in which a writer worked–on a scale we’ve never seen before.”

The implications for comparative literature and other fields that bump up against disciplinary boundaries are compelling.This kind of data analysis has long been the domain of sociologists, linguists and other social scientists, but it is increasingly finding a home in the humanities.

Steve Lohr, the Times article’s author, provides a number of other examples. One of my favorites is the research conducted by Jean Baptiste Michel and Erez Lieberman Aiden, who are based at Harvard. They utilized Google Books’ graph utility–open to the public–to chart the evolution of word use over long periods of time. One interesting example: for centuries, the references to “men” vastly outnumbered references to “women,” but in 1985 references to women began to lead references to men (Betty Friedan, are you there?)

Studying literature on this scale is indicative of the power and potential of big data to revolutionize how scholarship is done. Indeed, the availability of useful data is subtly transforming humanist scholars to the point that interested humanists are gaining a new identity as computer programmers.

Lohr also points out that quantitative methods are most effective when experts with deep knowledge of the subject matter guide the analysis, and even second-guess the algorithms.

What is new and distinctive is the ability to ramp up the study of a few texts to a few hundred text. The trick will be to keep the “humanity” in humanism.

I also draw considerable inspiration from the growing awareness that pattern recognition–a daily exercise for information professionals–is gaining new attention as part of the research process in general.

Perhaps it’s time for some of us to collaborate as co-principal investigators….

Metrics and Management: New Book, New Implications

The Org: The Underlying Logic of the Office, by Ray Fisman and Tim Sullivan (Twelve, 2013), was featured today on first page of the New York Times Business Day section. The article starts out by comparing British Petroleum’s record as a government enterprise, and then later as private corporation. (Pop quiz: what are the two biggest disasters BP created, and when did they occur?).  This alone is intriguing and suggests the book is a good read, but the following quotes really caught my attention:

“The more we reward those things we can measure, and not reward the things we care about but don’t measure, the more we will distort behavior.”  -Burton Weisbrod, Northwestern University.

“If what gets measured is what gets managed, then what gets managed is what gets done.,”  –-Fisman and Sullivan

–With the implication that what is not managed not only will not get done, but may go wrong in unforeseen ways (Deep Water, anyone?).

These insights apply to bibliometrics as well as to management, particularly when it comes to measuring the quality and impact of library services that are not (sufficiently) measured.

My December Column in Computers in Libraries: “Making Our Own Futures.”

Computers in Libraries magazine usually devotes its December issues to trend-spotting and future-casting. In that end of year spirit, I take a look at “additive manufacturing” (aka 3D printing), “Zoomable user interfaces (ZUIs), “Unsourcing” (I love that use of irony), and widely distributed Print on Demand.  Have a look at the December issue of CIL.