North Ridge Room
Wednesday July 11, 2012
Validating Quality in Large-Scale Digitization: Findings from Research on Image Error
From Project Gutenberg to Google Books, the large-scale digitization of books and serials is generating extraordinary collections of intellectual content that are transforming teaching and scholarship. Questions are being raised, however, regarding the quality and usefulness of digital surrogates produced by third-party vendors and deposited in digital repositories for preservation and access. For such repositories and their communities of users to trust that digital documents have the capacity to meet the uses envisioned for them, repositories must validate the quality of these objects and their fitness for the uses envisioned for them.
The purpose of this paper is to synthesize some of the initial findings and highlight the implications of a major ongoing research project, “Validating Quality in Large-Scale Digitization,” which is being undertaken at the University of Michigan’s School of Information. The two-phase project is exploring the relationship between quality (or its absence in the form of unacceptable error) and usability of digitized books at scale. The paper will report on the first phase of the project (2011-12), in which we have designed and tested a model of digitization error at the data, page, and volume level, and have applied the measures to statistically valid samples of digitized books. The error model measures the gap between the digitization ideal, represented by digitization best practices and standards, and the realities of repositories’ acceptance of digitized content produced by third parties.
The paper will present and interpret data gathered from three random samples of 1,000 digital surrogates that represent the full range of source volumes digitized by Google and other third party vendors. Proportional and systematic sampling of page-images within each volume in the samples produced a study set of over 350,000 page images, which were evaluated visually by highly trained coders working in two university libraries in different parts of the country. Using a web-enabled database system, coders assigned error severity scores for up to eleven possible errors in a carefully constructed model. The paper will describe the error model and then present findings on three research issues: (1) the ability of multiple coders to detect and rate page-image error reliably; (2) the distribution of page-image error across digitized volumes and sequentially within volumes; and (3) the co-occurrence of errors in the model. The findings on page-image error also provide a foundation for assessing the impact of error on underlying full-text data, on the readability of digitized books in an online environment, and on the subsequent management of physical print collections.
This research is funded by the US Institute of Museum and Library Services and The Andrew W. Mellon Foundation. Additional details about the project, including its metrics and progress, may be found on the project’s website: http://hathitrust-quality.projects.si.umich.edu/
As documentary photographs, real photo postcard (RPPC) street scenes are rich sites of cultural and social memory, valuable visual artefacts that deserve preservation and study. Street scenes are broad views of business districts and roadside attractions, and the period under study, 1930-1950, offer insights into popular culture, business activity, entertainment venues, population, race, gender and numerous aspects of social life. Walker Evans, a pioneer of American documentary photography, whose renowed work appears in teh 1941 Depression era class, And Now Let Us Praise Famous Men said that postcard street scenes are “a veritable catalog of the American experience (Willens, 2009).” Evans began collecting postcards at the age of twelve, and amassed a collection of over 9,000 exemplars. Street scenes represented the largest category in his collection. My presentation will examine the research value of RPPC street scenes in the context of a method borrowed from anthropology called rephotography. Rephotography is “a method used to monitor physical changes to places and people (Slyomovics, 2009, 460).” Postcards are a record and mirror of movement, of growth, of flux of people and society (Black & Klein, 1996). RPPC street scenes are rapidly disappearing, and this presentation is intended to draw attention to the need to preserve them, physically and digitally, so that the social memories they represent are not lost to future generations.
Social Collecting as Community Archives
This presentation will explore the everyday personal and community archives built in social media. Drawing on my dissertation project, in which I employ extended case study methodology to analyze collecting and curation on Social Media Sites such as Twitter, Tumblr, Pinterest, YouTube, Flickr, and Facebook. The cases examined include an intergenerational local history group formed in Facebook; an unprecedented activist effort on YouTube; an accidental archive grown out of Twitter, and a vibrant community built around self-portraits on Flickr. My project is grounded in critical theories of archival science. However, the findings can make an important contribution to the design and development of social computing software.
The project I present is organized around these questions:
1. How do social media platforms facilitate collaborative collections? What are the prominent types of collections, and how do they function?
2. How do social media collections function as community archives?
3. What is the relationship between social media platforms and the collections built in them? How can archival practice accommodate these collections, and how can platforms be better designed to facilitate preservation?