Data Discovery and Data Curation Go Hand in Hand

In just a few short years, data curation has been widely embraced by the profession and is recognized by many as an emerging core competency. The reasons are many, but the power of the web as a platform for mashing up diverse data sources is certainly a key factor. New government regulations require researchers to share data compiled in grant-funded research, which also provides a powerful incentive for taking a fresh look at how data can be preserved. In 2011, the Association of Research Libraries published an excellent summation of the potential of data curation for the library profession, titled “New Roles for New Times: Digital Curation for Preservation” (See This report was prescient in arguing that the volume of data and the need to preserve it is opening new opportunities for librarians to take center stage as collaborators.

Exciting times to be sure, but with all the new energy surrounding data curation of web- and crowd-sourced information, it is important to remember that new discovery techniques can also uncover fresh value in conventional data resources, particularly those that are generated by public mandate. For my part, I believe that there are significant “sleeper cells” of useful data—much of it gathered by public institutions—and these data can add value when they are added to born digital, linked data sets.

Many public information databases are compiled with a single need in mind: regulating construction permits, monitoring the growth of electrical grids, and so on. These data are often in digital formats, and they can be added to web- or cloud-based resources and used in ways that may not have been foreseen by the agencies that compile the data. The trick is to recognize not only what the primary goal for collecting is, but also to discover what value the data might have in different contexts. With that in mind, I will offer two examples of how data resources can empower new ideas in the broadest sense, and I will also share an old-fashioned data acquisition story “from the trenches.” The story shows how local data gathered by a public agency made the crucial difference in a research project—and suggests how it might gain value as part of larger-scale data analysis.

Big Data, Big Results

One of the best aspects of working with linked data is the ability to combine diverse sources of information and then extrapolate more nuanced meaning from the improved data set. This trend is accelerating, and currently it focuses on “new” and exciting areas such as crowd-sourced data generation and online consumer behavior-tracking. Rightly so: President Obama’s reelection campaign used data-driven strategies alongside its political and rhetorical vision, to considerable advantage. The 2012 U.S. elections proved beyond a doubt that smart data, carefully deployed, was worth more than the hundreds of millions of dollars that were hurled at the general electorate. The overall electoral cycle demonstrated that big data is recognized by politicians and entrepreneurs, as well as academics.

In the academic sphere, big data have created all-new approaches to research. The New York Times published an interesting update on how humanists can now analyze thousands of online novels (see The New York Times, January 27, 2013, p. B3). The article describes how Matthew L. Jockers at the University of Nebraska-Lincoln conducted word- and phrase-level textual analysis of digital books to study long-term language patterns. The much larger sample revealed not only how authors use words, but also how they inspire other authors over the years. One surprise finding: a relatively small number of authors have had an outsized impact on other writers, with Jane Austen and Sir Walter Scott at the forefront. This analytical approach is groundbreaking, insofar as it goes beyond the limitations imposed by much smaller samples of literature. The data application enables researchers to place authors in a larger historical context in ways that were not possible before.

Data driven political campaigns and large scale literature analysis demonstrate the blue sky nature of big data—and the attendant opportunities to curate the data that is being produced. Yet even as the new frontier expands at a rapid rate, it is still possible to find value in existing data sources. In my opinion, big data applications and data curation will reach their fullest potential when all sources, both old and new, are reexamined with the new tools.

New Value from Not-So-New Data

Not all data worth curating are born on the web. Agencies that oversee construction variances, hospitals, nursing homes, public works, and public health all gather data, but in many cases, their charge is to gather data for a single, specific purpose. The expected “data deliverable” might be tabular information for policy makers and urban planners, flowing from the stream of new construction permits, or other relatively mundane activities. It is easy to assume that such data may be well-targeted, but do not have transferable value. The following example of wage research proves the opposite.

During the 2012 election season, one of our researchers was monitoring “living wage” campaigns across the country and was very interested to see how they would fare. In the political discourse surrounding this issue, many voices argue that increasing the minimum wage is bad for business, raising costs and placing a burden on small firms in particular. Others argue that increasing low wages in nominal increments—75 cents, for example—has a negligible effect on the economy, and yet they help household incomes significantly. Our researcher wanted to assess the actual performance and policy ramifications of living wages to shed light on the debate, and needed help.

He needed to gather employee data on every fast food restaurant in a specific metropolitan region. Easily accessible sources indicated that there were more than 3500 establishments in all. Yet within that category, movie theaters, gas station convenience stores, and other purveyors of food-on-the-go needed to be winnowed out. None of the obvious data sources could provide such a pinpointed sample.

One of the library staff contacted the county agency that monitors food safety in restaurants, and eventually got through to their information technology department. She learned that the agency had detailed data on every establishment, including the exact number of employees at each location. This was the data our researcher needed to analyze low wage market dynamics and write a policy brief—just three weeks before the election.

The agency monitors restaurants for compliance with public health regulations. But—and this is a big but—that is literally all they are concerned about. They gather detailed data, but the data are only of interest when they find a safety infraction and must fine the offending restaurant. In our case, we had no interest in restaurant health and safety, but we very much wanted to know employee counts at every restaurant location. This sample would be useful as a basis for testing how living wage policies played out “on the ground.” The agency had exactly what we wanted, and we asked if they would be willing to share data set with us.

The IT manager agreed, with the proviso that no information about regulatory compliance would be sent to us—just the whole list of restaurants and their employee count. Once this was agreed upon, it took a few days to receive a data file that had all of what wanted.

These data provide a comprehensive resource for labor economists, and they will retain their value over the long term. Moreover, good relations with the regulatory agency have established a foundation for receiving data updates periodically. The dataset will also have added value if it is mashed together with other resources, such as state- and national level employee data, or coupled with Web- and cloud-based news and information about restaurants in the region.

Curate—But Counsel Too

This reference story drives home the fact that even while we are moving full-speed into an era when crowd-sourced, web-crawled, and tagged data are creating wholly new avenues for research, value still remains in ongoing data acquisition programs. Many public agencies produce data, and more often than not, they are well-managed and have a service mentality. When locally-gathered data of this nature are obtained and merged with other larger sources, the specificity of the local enriches the “big picture” that big data can reveal.

The emergence of big data research practices, which is revolutionizing how people parse data sets large and small, can actually strengthen the impact of library discovery skills. As a result Information professionals stand to benefit not through digital curation and getting involved in big data analysis, but also through the ongoing practice of reference and resource discovery. Because of this, I believe that it is important to promote our research and discovery acumen in the same manner that we are currently promoting the library as the “solution lab” for data curation. As admirable as that effort is, curation alone is, in my opinion, just half of the needed strategy. The crucial balance may be found by remembering that the skills inherent in reference work—discovery, pattern recognition, and analysis—offer a powerful means to convey our value proposition not only as data curators, but also as information counselors with advanced data acquisition skills.

This column appeared in Computers in Libraries, Vol. 33 (No. 3), April 2013.

A Trailblazer’s Second Thoughts on Big Data

First the Bad News

Big data enthusiasts will want to read Janet Maslin’s  review in The New York Times of Jaron Lanier’s newest book, Who Owns the Future, and perhaps the book itself. Many of us a have a tendency to look for the upside of social media and crowd-sourced information, so it can be helpful to be reminded by someone who knows best about the “dark” side.

Read all about it: “Fighting Words Against Big Data”

And Now the Long View:

BUT–when you are done with the review and (e)book, don’t miss the extremely interesting and highly useful “Big Data Compendium” that the Times has organized for folks like us:

Big Data Compendium:

Hari Seldon Lives: Revenge of the Original ‘Psychohistorians’

The February 23rd, 2013 issue of The Economist a brief but provocative article on the rapid development of massive data analysis by means of social media, and the potential to develop much better models to discover patterns of predictability—in other word, Isaac Asimov’s concept of psychohistory, as conceived in his Foundation novels.

Sci-Fi junkies nearly to a person would rank Asimov’s Foundation trilogy as one of the seminal works of science fiction. With a flair for “space opera” on a galactic level, Asimov sculpts a story in which science meets social sciences, and the resulting “Seldon Plan” would enable “psychohistory”—the forecast of society’s ups and downs—to steer humanity through and beyond a collapse of galactic civilization. In the course of the story he fleshes out the idea of the “scientist as hero,” later popularized by Kim Stanley Robinson in his Mars trilogy. This brand of hero essentially saves us from ourselves—whether the crisis at hand is a collapse in galactic civilization, or a mere well-organized expansion of human beings to Mars.

Well, if The Economist has captured an emerging scientific process, and if what is past is prologue, we may soon get a version of psychohistory in real time, although it might be a tad more primitive than Hari Seldon’s Plan.

The Economist profiles a number of projects underway that use Big Data to predict social outcomes, ranging from using cell phone records to chart “where” we are at any given time, to using epidemiology to forecast future vectors.

Politics—and we could some serious, big-time help in that arena—is the next frontier for the data crunchers. Boleslaw Szymanski of the Rennsselaer Polytechnic Institute is analyzing the role of “catalytic minorities,” which are groups of only about ten percent of a given population, but can suddenly swing public opinion in their direction.

The authors go on to speculate whether ultimately it might be possible to develop a theory of Society, much in the same manner that physicists are exploring a theory of everything.”  Now we are truly getting in Seldon territory, as our correspondent at The Economist call it. But modeling something as complex as society will not be easy. Our correspondent says:  “Small errors can quickly snowball to produce wildly different outcomes.”

In the Foundation series, the little glitch in the Seldon Plan appears in the form of an individual who has the extrasensory ability to influence peoples’ actions and minds—known as The Mule. He tipped the Seldon Plan off course, driving the story forward into unknown territory. The heroic psychohistorians labor to control for The Mule’s impact on their complex, formula-driven plan for humanity.  In the end (Whew!) they pull it off, and galactic civilization does not fall into a long dark age.

I think research along these lines is quite worthwhile, and in light of the 2012 election season, I can’t help wondering if we are seeing some baby steps in the direction that the Hari Seldons of the future might dare to tread. In the meantime, I’m left with the real-world record of social scientists in the here and now, and how they must “control” for every known factor as they set up models. Perhaps a theory of society is possible to some degree, but it would better overall if we would just behave as told to keep the models.  Somehow I think we will as recalcitrant as ever, to the dismay and disappointment of our current posse of scientific heroes….

Steve Lohr on the Origins of the Term “Big Data”

Data hounds will appreciate reading Steve Lohr’s concise but informative article in the February 1 edition of the New York Times, in which he takes a look at the origins of the moniker “big data.” It’s fun insofar as the term has drifted into common parlance after being mentioned here and there, but it may not be so easy to find a single individual whom to credit for its creation. The first time I ever regarded it seriously was when it appeared in a NBER Working Paper that addressed future career opportunities for economists in big data (I’ll add the cite once I track down again).

It reminds me of a local story involving moniker-manufacturing on a grand scale. During the late 1970s, The Oakland-Berkeley regional newspaper East Bay Express published an article by humorist Alice Kahn. In the article, Ms. Kahn coined the term “Yuppie.”  So far as anyone could tell, she was the first person to use the term, which meme-exploded across the USA in a few months. In subsequent issues The Express she turned it into an ongoing gag, because everybody she knew kept telling her, “We think you should sue” –for rights to the term. Humor being an “open source” product first and foremost, she didn’t sue, but did “work it” for what it was worth.

Back to big data.  Here’s a quote from the article, given by Fred R. Shapiro, Associate Librarian at Yale Law School and editor of the Yale Book of Quotations:

“The Web…opens up new terrain.What you’re seeing is a marriage of structured databases and novel, less structured materials. It can be a powerful tool to see far more.”

This is exactly the point that Autonomy and other e-discovery firms such as Recommind make:  to analyze the full output of a given company, corporation or legal case, you now have to look at all of the data. That includes the easier-to-parse world of structured data, but more and more it includes social media, email, recorded telephone conversations and many other casual (but critical) information resources.


Big Data Meets Literary Scholarship

The New York Times published a very interesting update on how humanists are applying big data approaches to their scholarship (see The New York Times, January 27, 2013, p. B3). The article begins with a description of research by Matthew L. Jockers at the University of Nebraska-Lincoln. He conducted word- and phrase-level textual analysis on thousands of novels, enabling longer-term patterns to emerge in how authors use words and find inspiration. This kind of textual analysis revealed the impact of a few major authors on many others, and identified the outsized impact of Jane Austen and Sir Walter Scott.

Jockers said that “Traditionally, literary history was done by studying a relative handful of texts…what this technology does is let you see the big picture–the context in which a writer worked–on a scale we’ve never seen before.”

The implications for comparative literature and other fields that bump up against disciplinary boundaries are compelling.This kind of data analysis has long been the domain of sociologists, linguists and other social scientists, but it is increasingly finding a home in the humanities.

Steve Lohr, the Times article’s author, provides a number of other examples. One of my favorites is the research conducted by Jean Baptiste Michel and Erez Lieberman Aiden, who are based at Harvard. They utilized Google Books’ graph utility–open to the public–to chart the evolution of word use over long periods of time. One interesting example: for centuries, the references to “men” vastly outnumbered references to “women,” but in 1985 references to women began to lead references to men (Betty Friedan, are you there?)

Studying literature on this scale is indicative of the power and potential of big data to revolutionize how scholarship is done. Indeed, the availability of useful data is subtly transforming humanist scholars to the point that interested humanists are gaining a new identity as computer programmers.

Lohr also points out that quantitative methods are most effective when experts with deep knowledge of the subject matter guide the analysis, and even second-guess the algorithms.

What is new and distinctive is the ability to ramp up the study of a few texts to a few hundred text. The trick will be to keep the “humanity” in humanism.

I also draw considerable inspiration from the growing awareness that pattern recognition–a daily exercise for information professionals–is gaining new attention as part of the research process in general.

Perhaps it’s time for some of us to collaborate as co-principal investigators….

Metrics and Management: New Book, New Implications

The Org: The Underlying Logic of the Office, by Ray Fisman and Tim Sullivan (Twelve, 2013), was featured today on first page of the New York Times Business Day section. The article starts out by comparing British Petroleum’s record as a government enterprise, and then later as private corporation. (Pop quiz: what are the two biggest disasters BP created, and when did they occur?).  This alone is intriguing and suggests the book is a good read, but the following quotes really caught my attention:

“The more we reward those things we can measure, and not reward the things we care about but don’t measure, the more we will distort behavior.”  -Burton Weisbrod, Northwestern University.

“If what gets measured is what gets managed, then what gets managed is what gets done.,”  –-Fisman and Sullivan

–With the implication that what is not managed not only will not get done, but may go wrong in unforeseen ways (Deep Water, anyone?).

These insights apply to bibliometrics as well as to management, particularly when it comes to measuring the quality and impact of library services that are not (sufficiently) measured.

The “Smart Campaign” of Election 2012: The “Visceral” Trumps the Data

Zeynep Tufekci published an excellent opinion piece in the New York Times on Sunday, November 16, 2012. He joined a cast of thousands—millions?—who are trying to make sense of the 2012 election. I liked his piece because he drills deep into the heart of big data and big politics, and focuses on the acknowledged leaders: President Obama’s reelection campaign’s number crunchers.

By now anyone who can read a blog on the Web has heard a lot about the Obama “ground game” and how, unbeknownst to all manner of pundits, particularly on the right, a huge swath of hitherto unknown voters, many of whom hail from nice towns in Ohio, can be induced to get out and vote.  The problem, according to Mr. Tufecki, is how they got there. Namely, by using state of the art data analyses and harnessing the entire fire hose of personal data that is constantly aimed at—us.

The striking success of Team Obama has left Gov. Romney himself, and many of the leading names in the right wing aristocracy of punditry gasping for breath. Well: it’s hard to resist a moment of schadenfreude after all the Rove years that we have survived. Even so, there was something poignant about the cluelessness of the Republican campaign; all of the tools available to the President, ranging from human know-how to data modeling, were available to them too.

But Mr Tufecki caught my attention by focusing on the “personhood” of people, versus their value as digital factoids and a mere means to an end. He spells out a very well-stated argument for the importance of retaining individual liberty in every possible way–including liberty for our “digital footprints.”  It would appear that a need for liberty goes farther and applies to the Big Data Machine that is forever crunching away at our identities.

I agree with Mr. Tufecki on nearly all counts. After all, it’s patriotic to believe in checks and balances no matter where you fall on the political spectrum. Moreover, I admire the president and his team for the way they won, virtually in the face of a rhetorical tidal wave that seemed to suggest that the Republican ticket had a chance, when it turns out that it did not.

But here’s where I part company with Mr. Tufecki: despite all of the high tech wizardry of the president’s reelection team, I still strongly believe that it was the president himself who won the race, not the numbers racket. Although it is quite possible that the Republican campaign did not itself any favors by pushing a platform that seemed to have no place for  gay, lesbian or transgendered folks, African Americans, Latin Americans, Asian Americans, women of any political inclination; progressive white males. There was a visceral sense of high stakes many people in this election, and every person I spoke with (primarily in the San Francisco Bay Area) had a deep personal stake in the president’s victory.

Interestingly, the searing nature of the political rhetoric was so severe in its effects that normally indifferent voters and even conservative voters, in some cases, simply could not bring themselves to vote for the punitive platform of the Republicans. It just seemed so obvious that you, whoever you are—add any descriptors you wish, including blue-stater white male—couldn’t fit in the Republican view of the world. Karl Rove’s well-publicized Fox “meltdown” seemed to suggest that he didn’t know how appalling he had become to—well, everybody I ever met. So while the big data numbers game was important, in my view, nothing was more motivating than simply having to endure hearing the sorts of people you respect and love being basically regarded as unwanted “takers” and worse.

In the contest for what America was going to choose to be, it is very clear that the data helped, along with perhaps some pretty steely nerves in the Chicago headquarters. But in the end, the “informed electorate” of the Information Society exercised critical thinking. The result ran parallel with the data curve, but the gut-level motivation-derived-from-information was: vote for President Obama or say good bye to any reasonable facsimile of a nation you once thought you were a citizen of.

So even while the Smart Campaign was a foundation for success, the 2012 election, in my opinion, was won on a more visceral level. And at least in the case of this election season, “personhood” seemed to be alive and kicking.

“From Bibliometrics to ‘Altmetrics'”

The latest C&RL News has a very useful article that describes how we are quickly moving beyond the traditional “journal impact factor” as a single, definitive means for ranking scholarly works. The article also explores new resources and strategies to rank and evaluate works in new media. This is not a new concept, but the ascendance of social media and new ways to publish online has accelerated, and as a result faculty members are much more concerned about how to establish credit for their work than they were just a few years ago.  Have a look at:
Robin Chin Roemer & RachelBorchadt

“From bibliometrics to altmetrics: A changing scholarly landscape.”  C&RL News 73, No.10, November 2012, pp. 596-600.