Data Discovery and Data Curation Go Hand in Hand

In just a few short years, data curation has been widely embraced by the profession and is recognized by many as an emerging core competency. The reasons are many, but the power of the web as a platform for mashing up diverse data sources is certainly a key factor. New government regulations require researchers to share data compiled in grant-funded research, which also provides a powerful incentive for taking a fresh look at how data can be preserved. In 2011, the Association of Research Libraries published an excellent summation of the potential of data curation for the library profession, titled “New Roles for New Times: Digital Curation for Preservation” (See http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf). This report was prescient in arguing that the volume of data and the need to preserve it is opening new opportunities for librarians to take center stage as collaborators.

Exciting times to be sure, but with all the new energy surrounding data curation of web- and crowd-sourced information, it is important to remember that new discovery techniques can also uncover fresh value in conventional data resources, particularly those that are generated by public mandate. For my part, I believe that there are significant “sleeper cells” of useful data—much of it gathered by public institutions—and these data can add value when they are added to born digital, linked data sets.

Many public information databases are compiled with a single need in mind: regulating construction permits, monitoring the growth of electrical grids, and so on. These data are often in digital formats, and they can be added to web- or cloud-based resources and used in ways that may not have been foreseen by the agencies that compile the data. The trick is to recognize not only what the primary goal for collecting is, but also to discover what value the data might have in different contexts. With that in mind, I will offer two examples of how data resources can empower new ideas in the broadest sense, and I will also share an old-fashioned data acquisition story “from the trenches.” The story shows how local data gathered by a public agency made the crucial difference in a research project—and suggests how it might gain value as part of larger-scale data analysis.

Big Data, Big Results

One of the best aspects of working with linked data is the ability to combine diverse sources of information and then extrapolate more nuanced meaning from the improved data set. This trend is accelerating, and currently it focuses on “new” and exciting areas such as crowd-sourced data generation and online consumer behavior-tracking. Rightly so: President Obama’s reelection campaign used data-driven strategies alongside its political and rhetorical vision, to considerable advantage. The 2012 U.S. elections proved beyond a doubt that smart data, carefully deployed, was worth more than the hundreds of millions of dollars that were hurled at the general electorate. The overall electoral cycle demonstrated that big data is recognized by politicians and entrepreneurs, as well as academics.

In the academic sphere, big data have created all-new approaches to research. The New York Times published an interesting update on how humanists can now analyze thousands of online novels (see The New York Times, January 27, 2013, p. B3). The article describes how Matthew L. Jockers at the University of Nebraska-Lincoln conducted word- and phrase-level textual analysis of digital books to study long-term language patterns. The much larger sample revealed not only how authors use words, but also how they inspire other authors over the years. One surprise finding: a relatively small number of authors have had an outsized impact on other writers, with Jane Austen and Sir Walter Scott at the forefront. This analytical approach is groundbreaking, insofar as it goes beyond the limitations imposed by much smaller samples of literature. The data application enables researchers to place authors in a larger historical context in ways that were not possible before.

Data driven political campaigns and large scale literature analysis demonstrate the blue sky nature of big data—and the attendant opportunities to curate the data that is being produced. Yet even as the new frontier expands at a rapid rate, it is still possible to find value in existing data sources. In my opinion, big data applications and data curation will reach their fullest potential when all sources, both old and new, are reexamined with the new tools.

New Value from Not-So-New Data

Not all data worth curating are born on the web. Agencies that oversee construction variances, hospitals, nursing homes, public works, and public health all gather data, but in many cases, their charge is to gather data for a single, specific purpose. The expected “data deliverable” might be tabular information for policy makers and urban planners, flowing from the stream of new construction permits, or other relatively mundane activities. It is easy to assume that such data may be well-targeted, but do not have transferable value. The following example of wage research proves the opposite.

During the 2012 election season, one of our researchers was monitoring “living wage” campaigns across the country and was very interested to see how they would fare. In the political discourse surrounding this issue, many voices argue that increasing the minimum wage is bad for business, raising costs and placing a burden on small firms in particular. Others argue that increasing low wages in nominal increments—75 cents, for example—has a negligible effect on the economy, and yet they help household incomes significantly. Our researcher wanted to assess the actual performance and policy ramifications of living wages to shed light on the debate, and needed help.

He needed to gather employee data on every fast food restaurant in a specific metropolitan region. Easily accessible sources indicated that there were more than 3500 establishments in all. Yet within that category, movie theaters, gas station convenience stores, and other purveyors of food-on-the-go needed to be winnowed out. None of the obvious data sources could provide such a pinpointed sample.

One of the library staff contacted the county agency that monitors food safety in restaurants, and eventually got through to their information technology department. She learned that the agency had detailed data on every establishment, including the exact number of employees at each location. This was the data our researcher needed to analyze low wage market dynamics and write a policy brief—just three weeks before the election.

The agency monitors restaurants for compliance with public health regulations. But—and this is a big but—that is literally all they are concerned about. They gather detailed data, but the data are only of interest when they find a safety infraction and must fine the offending restaurant. In our case, we had no interest in restaurant health and safety, but we very much wanted to know employee counts at every restaurant location. This sample would be useful as a basis for testing how living wage policies played out “on the ground.” The agency had exactly what we wanted, and we asked if they would be willing to share data set with us.

The IT manager agreed, with the proviso that no information about regulatory compliance would be sent to us—just the whole list of restaurants and their employee count. Once this was agreed upon, it took a few days to receive a data file that had all of what wanted.

These data provide a comprehensive resource for labor economists, and they will retain their value over the long term. Moreover, good relations with the regulatory agency have established a foundation for receiving data updates periodically. The dataset will also have added value if it is mashed together with other resources, such as state- and national level employee data, or coupled with Web- and cloud-based news and information about restaurants in the region.

Curate—But Counsel Too

This reference story drives home the fact that even while we are moving full-speed into an era when crowd-sourced, web-crawled, and tagged data are creating wholly new avenues for research, value still remains in ongoing data acquisition programs. Many public agencies produce data, and more often than not, they are well-managed and have a service mentality. When locally-gathered data of this nature are obtained and merged with other larger sources, the specificity of the local enriches the “big picture” that big data can reveal.

The emergence of big data research practices, which is revolutionizing how people parse data sets large and small, can actually strengthen the impact of library discovery skills. As a result Information professionals stand to benefit not through digital curation and getting involved in big data analysis, but also through the ongoing practice of reference and resource discovery. Because of this, I believe that it is important to promote our research and discovery acumen in the same manner that we are currently promoting the library as the “solution lab” for data curation. As admirable as that effort is, curation alone is, in my opinion, just half of the needed strategy. The crucial balance may be found by remembering that the skills inherent in reference work—discovery, pattern recognition, and analysis—offer a powerful means to convey our value proposition not only as data curators, but also as information counselors with advanced data acquisition skills.

This column appeared in Computers in Libraries, Vol. 33 (No. 3), April 2013.

Duking it Out in the E-Book’s “Wild West” Marketplace

(NOTE: This article appeared in computers in libraries 33 (no 1), jan-feb 2013. in light of current litigation, I’m posting it to information | mixology)

 

The e-book is a new medium, but it follows many other breakthrough products with histories of disruption, adoption, market acceptance, and the forging of new business relationships. Perennials such as CD-Roms, DVDs and iPods come to mind, as each of these new technologies triggered important changes in commerce and entertainment. The disruption was real and has caused serious distress for publishers, but there is no getting around the fact that we are in a new era now. Publishers have gained expertise in digital media and are engaged in intensive experimentation. They are taking big risks with e-books and trying new innovative pricing models. And they are playing a tough game to protect revenue.

The e-book market is moving at “warp speed” and it is hard to stay abreast of events. Fortunately librarians have been lucky in our leadership. The American Library Association has been very assertive in advocating for the most expansive model acquiring and loaning e-books. The result has been a lot of “dialogue,” some tough new policies from the largest publishers, and a sense that it is hard to know what is going to happen next. Authors are involved too, and have their own turf to protect.

With so much ferment, what strategies should librarians adopt to become central to the e-book market? Also, what are the best avenues for revitalizing our long-term relationships with publishers? I see two fundamental strengths that might inform our actions. The first is our close relationship with our user communities. The second is a combination of two related sources of knowledge: how to perceive the e-book “market” from a user perspective, and how to collaborate. A collaborator understands the importance of keeping a balance between open access and making a profit—and that kind of awareness may be the “glue” that keeps libraries and publishers in conversation. Even so, the next few years might be bumpy for e-book collaborators. Here are few signposts of the times, and some thoughts about where we are collectively going.

Borrowing, Buying and Both

2012 has been a year for learning a lot about e-books—and recognizing that we need to know even more. We need better data, too. Blogger Jeremy Greenfield is one great source of intelligence. He is a journalist who follows the e-book industry, both on his own blog and for Forbes (see digitalbookworld.com). In June 2012 he reported on something many of us probably know:  libraries and publishers don’t understand each other. Publishers don’t “get” the operational side of libraries, and ALA President Molly Raphael allowed that librarians have more to learn about the e-book market and its effect on publishers and distributors. The result has been a great deal of dialogue, and that’s a good thing. We are talking intensively to publishers to advocate for better access to e-books and reasonable curbs on pricing.

We have good company in our quest to understand how to deploy e-books, too. Greenfield looks to the research findings of the Pew Project on Internet and Society, which has long been a leading voice in analyzing disruptive technologies. Pew reports that e-book consumers are likely both to buy and borrow e-books. What a fascinating approach; it suggests that there is personal joy and comfort in “owning” a digital copy of say, Tolkien’s The Silmarillion, while you might just want to “borrow” a new book by Dean Koontz have an enjoyable, one-time read.

So: buying, borrowing, and both: it’s a user’s solution to a complex market, and it works.  As an iPad Mini user, I enjoy checking out what e-books other people keep on their tablets whenever they allow me have a look. Here’s what I find: a collection of favorite books that tends to grow, slowly but surely. I also see a smart shopper’s independent streak concerning “where” they do their buying and “borrowing.” The Kindle app for iPad is widely used on iPads, even though it is an arch-competitor to iBooks. This open-minded “collection building” and shopping suggests a deep love for the artifact, even in digital form, as well as interest in value-shopping in every direction.

It’s hard not love books that enrich our lives, and it looks like the digital version, read on a retina screen, preserves that crucial value. But “process” matters too. Some people favor bookstores, others prefer Amazon Prime, and still more troll through Apple’s ecosystem of goodies.  Many are still in the process of deciding what they like best. That’s very big unknown factor for publishers, and our familiarity with user behavior gives us an edge. For example, a friend of mine recently bought a Nook e-reader, and set herself up with a library of favorite titles and authors. Within a month, she was back to print, because she didn’t enjoy the process and experience of e-reading.

The Risk of Rhetoric

It would be an understatement to say the libraries and publishers are worlds apart in how they approach the challenge of e-reading. One of the single biggest risks is frank and committed dialogue might give way to rhetorical warfare. Many advocates of open access already feel that large publishers, such as Harper Collins with its 26-times-only policy, or Random House, with its mammoth price increase to recover for simultaneous and persistent access, have gone over to the “dark side.” Fortunately ALA has taken a lead in trying to forge common understandings, which is helpful, since Jeremy Greenfield reports that 67.2 percent of libraries have been loaning e-books since 2011.  Moreover, experimentation is essential, but publishers face a serious obstacle:  they cannot collaborate to set prices without facing antitrust litigation. In the resulting free-for-all, every e-book publisher must come up with its own pricing plan. In some ways, the current e-book market has a wild west, “Dodge City” feel.

What strategies can librarians use in the “Dodge City” of e-book pricing? I can think of two. Stay close to their user communities and make sure they know that we are advocating for them, and also continue to keep a place at the table to debate a fair balance that addresses the needs of publishers, distributors, and libraries as collaborators.

“Windowing”

Other entertainment industries offer some guidance on how to sell and how to price, particularly cinema and music. But once again it is worth noting that conditions change fast. The iTunes Library faces competition from subscription services such as Spotify, and the market may change in the near term. But some of the lessons learned may be worth a look. Blogger Michael Schatzkin reports, Hollywood has perfected the art of “windowing” —delaying the release of DVDs until new movies have had a chance to earn their keep at the box office. Move studios are reluctant to hand over their entire catalogs to Amazon and NetFlix, for good reason, if they can still sell DVDs first.

The Windowing approach is an intriguing alternative for publishers, distributors, and libraries, but it has some built-in shortcomings. Most people want to read their favorite authors right away, and many people (myself included) reserve copies new releases months before they appear in print. Would library patrons accept waiting one or two years to borrow an e-book? That seems like a stretch. Therefore my theory is that publishers can certainly try a windowing approach, delaying the release of  e-books and perhaps employing a significant markdown, but I think they may face a reader backlash. Social media give activists a very handy tool for registering their dissatisfaction. Perhaps the e-book market will spawn a “reader’s guild” of activists, who could use the power of social media to shape policy.

What’s Needed: Unity

My first career was in independent bookselling, and for that reason I follow the publishing business closely. I find that the many “year of the e-book” debates that are running at full steam follow common threads that go back as far as the release of the mass-market paperback, which was seen as a force of doom for publishers—but was anything but that. The eventual outcome of the e-book debate carries high stakes for publishers, distributors, and libraries, but there is some good news too, showing up among all three stakeholders. Publishers have become much more skilled in handling digital media, and this is making them less conservative. We can now expect some healthy innovation from them. Distributors are crucial players in the sales process, and they have gained more clout. Perhaps they too will push back on pricing and access restrictions as a form of self-preservation. If so this may help consumers. Librarians have become the most articulate advocates for the importance of open access and fair use; we have done our homework and have a compelling “social good” to use as a rallying cry.

Each group has gained through innovation, and yet each  has more to learn about a very important function of markets: mutual benefit. At a time when the e-books debate threatens to push players into armed camps, it is vital to find common ground and build unity. If we fail to do that, we should have the courage to admit that the real losers will be readers themselves, who rightly expect us to do a better job of managing the emerging e-book market.