Preparing Digital Surrogates for RLG Cultural Materials
Recommendations for digitizing
- Textual materials
- Pictorial materials
- Complex digital objects
- File naming
These are general recommendations , not absolute requirements. Since each digitization project is unique, there may be very good reasons for using alternative quality guidelines or for choosing a different approach (although we strongly recommend that you be consistent within a project). For RLG-funded digitization, please discuss intended variations from these recommendations with Ricky.Erway@oclc.org. For other projects, consider this guidance in the context of your organizational requirements and proceed accordingly.
- Avoid device-specific color space, format, headers, etc.
- Size and save page images at 1:1 scale to the dimensions f the original pages.
- For optimal sharpness, view images on the monitor at 100 percent (i.e., each pixel on the screen representing each captured pixel of the image). Evaluate an area of the image that depicts details and edges.
- Be sure the whole image area (with edges) has been scanned and no part of it has been cropped.
- Scan the image in the correct orientation or correct the image orientation in postprocessing.
- Avoid skew by placing the originals squarely on the scanner. Rescan a skewed image rather than rotating it after scanning.
- Check for artifacts such as dropout lines or pixels, banding, lack of uniformity, poor color registration, aliasing, flaring, and contouring.
- You can create just a digital image, or a digital image and a machine-readable text.
- Determine in advance if blank pages will be scanned.
- Create images that meet or exceed these characteristics:
|Printed texts and/or line drawings||600 dpi, 1-bit|
|Grayscale, half-tone, and other black-and-white illustrations||300 dpi, 8-bit|
|Color illustrated texts||300 dpi, 24-bit|
|Rare/early printed texts||300 dpi, 8- or 24-bit|
- Use Intel TIFF v 5.0 or 6.0 uncompressed or with lossless compression (ITU Group 4 for 1-bit or LZW for 8- or 24-bit).
- RGB or PhotoYCC are recommended as acceptable color spaces for digital masters.
- For machine-readable text, key in or use OCR text in ASCII, UTF-8, or Unicode, preferably corrected to at least 99.995% accuracy, and encoded (e.g., as specified in TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices Version 1.0 (Digital Library Federation, July 1999).
Use these resolutions whether scanning from originals or intermediates:
|Black-and-white photos||400 dpi, 8-bit|
|Color photos||400 dpi, 24-bit|
|Slides or small negatives||Effective resolution of 400 dpi, 8- or 24-bit|
- Use Intel TIFF v 5.0 or 6.0 uncompressed or with lossless compression (LZW).
- RGB or PhotoYCC are recommended as acceptable color spaces for digitalmasters.
Where these recommendations offer a choice, make your decision based on the nature of the original. For example, spoken-word conversion requirements are sometimes lower than those for other recorded sound such as music. However, an old music recording may not merit such high-quality capture as an excellent spoken-word recording.
96 or 48 kHz; 24 bits
Bitstream: Uncompressed PCM
MP3 (aka MPEG-1/2 layer 3 audio)
|Master file||Component digital video bitstream (4:2:2 sampling rate) uncompressed. Note: the data rate for 4:2:2 is 270 Mbits/sec.|
|Service file||Compressed MPEG-2 files at pixel dimensions and data rates determined by contributor, possibly from a low of 1.2 Mbits/sec to a high of 15 Mbits/sec.|
Complex digital objects
- When digitizing component parts of an object, take care to maintain their relationships. For example, when capturing an album, consider, What is the relationship of the parts to the whole? Should each page be captured separately or should two pages be captured at once? Do the album pages have intrinsic significance, or is it sufficient to capture the images from each page? Is there a relationship between the spreads that should be maintained, or is an indication of sequence enough to recreate the experience of looking through the album?
- Provide structural metadata for complex digital objects to allow for navigation within the object. Preferably, use the Metadata Encoding Transmission Standard (METS). If you do not use METS, include a link in the record to the text file (if there is one), and a start image and end image.
- Use file-naming schemes that are compatible across platforms and systems. Minimize the length of the name. Use only lower case characters a-z, numerical digits, and the following special characters: . _ - (period, underscore, and hyphen). Do not use spaces or any other special characters.
- Prefer a numbering scheme that reflects numbers already used in an existing cataloging system; if scanning precedes cataloging, use serial file names that will be incorporated into the catalog record.
- When developing a file-naming scheme, have a good understanding of the whole project. How many images will be scanned? Will they be stored in different directories? Are the files part of larger complex objects?
- Use standard file extensions (e.g., tif, .wav, .mp3, mpg, .rm, .txt., .sgm, .xml) in lower case only.
- Make sure the file references in your descriptive records match the file names (the extension may be omitted only if it is the same for every image.) The case of file references in the descriptive records must match the case of the actual file names.
- Replicate the directory structure as referenced in the descriptive records.
- Don't overload directories with too many files.
Naming a collection of thousands of simple objects (e.g., a photographic collection)
- Subdivide them in a meaningful way (by series or group) or by chunk (same prefix or in groups of a thousand).
- Use the reproduction, accession, or a serial number as the stem of the file name.
- Add a code for special features:
b for back, if scanning information on the back of a print
d for a detail of a larger image
Naming a complex object such as a book
- Create a directory for each object, using an identifying string for the object as the directory name.
- If there is a text file for the whole object (e.g., an SGML file), use the same string in its file name.
- For page image file names, use a sequential image number followed by the printed page number (when present), both with leading zeros, to fit the pattern "cccpppf", where:
- "ccc" is the image control number. These first three digits are used to assign a set of sequential numbers to all of the images for the book. The first image from the book is assigned control number 001; it reproduces the book cover. Control number 002 might be the illustrated end paper, 003 might be a title page, etc. depending on the book. If a document-start target is provided, scan it and give it the file name stem, 000000. If missing pages are encountered, scan a "missing page" target and assign the relevant control number.
- "ppp" is the printed page number. These next three digits carry the actual printed page number with leading zeros. If the number is Roman, provide the Arabic translation. If there is no printed page number, use 000.
- Assign a code for special features:
g — Title Page (if the work has more than one, indicate the main title page)
n — Table of Contents (if more than one page, indicate all pages)
l — List of Illustrations (if more than one page, indicate all pages)
f — Illustration (not a page image including an illustration, but an additional image cropped to include only the illustration)
x — Index (if more than one page, indicate all pages)
y — Missing page or other irregularity target
Example: a book with the ID "mas 014" would be in a directory named mas014; it might contain these files:
mas014/004000g.tif (title page)
mas014/007000n.tif (contents cont.)
mas014/008003.tif (first numbered page)
Naming a manuscript collection
- Create a directory for the collection, and subdirectories for each series, box, and/or folder.
- If there is a text file for the whole collection (or for each series, box, or folder) use the same string in its file name and place it in that directory.
- The page image file names will consist of a sequential image number with leading zeros. Since folders generally contain fewer than a thousand pages, you can use a three-digit number (including leading zeroes) for page-image naming. If a document-start target is provided, scan it and give it the file name stem, 000.
- Assign a code for special features:
b for back side of a page
s for start of a new document—since documents and pages are not equivalent, indicate when a new document (report, letter, etc.) begins by adding an s at the end of the file name for each image that represents the start of a new document
Example: a manuscript collection with the collection identifier stw, would be in the directory stw, with the following subdirectories and files:
stw/corresp/81/23/001s.tif (first page)
stw/corresp/81/23/004s.tif (start of new document)
Choose one of these methods:
- On media: ISO 9660 CDs or TAR on DLT.
- For RLG to pick up via FTP: provide access to the directory structure as referenced in the records.
- By FTP to RLG: copy the file directory structure referenced in records
- METS: Metadata Encoding & Transmission Standard Official Web site
- Digital Library Federation, " Benchmark for Faithful Digital Reproductions
of Monographs and Serials " (December 2002)
- Cornell University Library/Department of Preservation and Conservation, " Moving Theory into Practice: Digital Imaging Tutorial" (2000-2001)
- RLG, Digital Library Federation, " Guides to Quality in Visual Resource Imaging" (July 2000)
- Library of Congress, " Digital Audio-Video Repository System Support: Attachments to an RFQ, 5. Special Considerations For Digital Video and Audio"
- Digital Library Federation, " TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices," Version 1.0 (30 July 1999)
Recommended background for digitizing decisions
Selecting Library and Archive Collections for Digital Reformatting . Proceedings from an RLG Symposium Held November 5-6, 1995 in Washington, DC.
RLG Guidelines for Creating a Request for Proposal for Digital Imaging Services (pdf) RLG, 1997 (May 1998).
RLG Model Request for Information for Digital Imaging Services (pdf) RLG, 1997.
RLG Model Request for Proposal for Digital Imaging Services (pdf) RLG, 1997.
RLG Worksheet for Estimating Digital Reformatting Costs (pdf) RLG, 1997 (May 1998).
Anne R. Kenney and Oya Y. Rieger, Moving Theory Into Practice; Digital Imaging for Libraries and Archives RLG, 2000 (see RLG Programs Books and Reports).
Guides to Quality in Visual Resource Imaging Digital Library Federation (DLF) and RLG, 2000.
Steven Puglia, "The Costs of Digital Imaging Projects", RLG DigiNews vol. 3, no. 5 (October 15, 1999).
Imaging halftones: Anne R. Kenney and Louise Sharpe II, "Illustrated Book Study: Digital Conversion Requirements of Printed Illustrations", The Library of Congress Preservation (July, 1999).
Imaging from microfilm: Louis H. Sharpe II, et al., Library of Congress Manuscript Digitization Demonstration Project Final Report October 1998.
Selection, preparation, capture, metadata, archiving: Joint RLG and NPO Preservation Conference: Guidelines for Digital Imaging , September 1998.
RLG Working Group on Preservation Issues of Metadata, Final Report RLG, May 1998.
Franziska Frey, Digital Imaging for Photographic Collections: Foundations for Technical Standards", RLG DigiNews, vol. 1 no. 3 (December 15, 1997).
Howard Besser and Jennifer Trant, An Introduction to Imaging, Getty Information Institute, 1995.
TEI: The TEI Guidelines TEI, 2001.
TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices Version 1.0 Digital Library Federation, July 1999.
Alan Morrison, Michael Popham, and Karen Wilkander, Creating and Documenting Electronic Texts: A Guide to Good Practice AHDS Guides to Good Practice, 1998.
Bruce Fries with Marty Fries, The MP3 and Internet Audio Handbook TeamCom Books, 2000: Chapter 11, "A Digital Audio Primer" and Chapter 12, "Digital Audio Formats"
Dave Anderson, The PC Technology Guide: Digital Video (2002).
Digital conversion service bureaus
RLG did not endorse these service providers, but received positive reports from those who had used them.
Apex CoVantage ePublishing Solutions
120 Presidents Plaza
198 Van Buren Street
Herndon, VA 20170
Contacts: Margaret Boryczka or Tom O'Brien
text conversion, SGML markup, EAD
Backstage Library Works
1180 South 800 East
Orem, Utah 84097
Contact: Jodi Moore, Marketing Manager
on-site/off-site scanning; text/prints/transparencies/realia; oversize; bound; data conversion; metadata processing; OCR
Bar-Hama Blumenthal Digital Photography
450 Park avenue
New York, NY 10022
Contacts: Ardon Bar-Hama or George Blumenthal
on-site, high resolution digital photography of rare books & manuscripts
Boston Photo Imaging
20 Newbury Street
Boston, MA 02116
Contact: David Sempberger
Data Conversion Laboratory, Inc.
61-18 190th St., 2nd Floor
Fresh Meadows, NY 11365
Contact: Shavy Schwimmer, firstname.lastname@example.org
scanning, OCR and text entry, SGML
Direct Data Capture Ltd (UK and NY)
73 B Ormskirk Business Park
New Court Way
L39 2YT, UK
Phone: 01695 570707
bound volume/microfilm scanning, text conversion
Higher Education Digitisation Service
University of Hertfordshire
AL10 9AB UK
Phone: +44 1707 286078
digitization of all manner of originals
Innodata Content Services
Three University Plaza
Hackensack, New Jersey 07601
Contact: Joan Meyer, email@example.com, or Steven Keyes, firstname.lastname@example.org, or Jan Palmen
data aggregation and conversion, XML transformation, OCR, and image scanning
Input Solutions, Inc (ISI)
Contact: John Solomon
scanning and conversion, microfilm, oversize, text, SGML
New York Production Center
231 W. 29th Street
New York, NY 10001
Contact: Anthony Troncale
high-quality digital reproductions of pictorial works, including line and photographic images and manuscripts; specializing in conversion of large collections
Kirtas Technologies, Inc.
7620 Omnitech Place
Victor, New York 14564-9782
Phone: (585) 924-2420, ext. 3008
Contact: Michael Maxwell, Director of Worldwide Sales
Non-destructive, high quality, inexpensive, bound document scanning (on and off-site) of books, journals, magazines, lab notebooks, etc. with OCR and metadata capture capabilities
Luna Imaging, Inc.
3542 Hayden Ave., Bldg. One
Culver City, CA 90232-2413
film and print scanning, direct digital photography, image editing and post-production, on-site services, image studio/workflow consulting
9 Commerce Way
Bethlehem, PA 18017
6700 Corporate Dr.
Kansas City, MO 64120
text conversion, SGML
Systems Integration Group, Inc.
9701 Philadelphia Court
Building 17, Suite A
Lanham, Maryland 20706
on-site/off-site document scanning, text conversion, SGML
Two Cat Digital, Inc.
14717 Catalina Street
San Leandro, CA 94577
Contact: Howard Brainen, email@example.com
film and print scanning, direct digital photography, image editing, bulk image processing services, automated systems, image databases, on-site services, digital imaging consulting
Suggestions if you've already created your digital surrogates
The following were suggestions for the quality and format of the files already digitized. These were not requirements.
Formats and compression: In general, you'll probably want to keep a TIFF (Tagged Image File Format, version 5 or 6 with Intel headers) version of the image with lossless compression (ITU 4 for black and white or LZW for grayscale or color) or no compression, but a JPEG compressed image will suffice for contribution to RLG Cultural Materials. Alternatively, PhotoCD images may meet your local needs, and JPEGs can be created from those images for contribution to RLG Cultural Materials.
|Black-and-white text and line art||300-600 dpi bitonal|
|Halftone illustrations||300-400 dpi, 8 bpp or 24bpp|
|Oversized (e.g., maps or posters)||300 dpi bitonal, 8 bpp or 24 bpp|
|Manuscript page images||300-400 dpi, 8 bpp (24 bpp for color, tinted, or discolored originals)|
|35mm photographic negatives or slides
(reverse polarity if negative)
|3000 pixels in long dimension, 8 bpp or 24 bpp|
|Photographic prints and transparencies
(4x5, 6x8, 8x10)
|4000-6000 pixels in long dimension, 8 bpp or 24 bpp|
|Printed page (OCR or rekey)||99.95% accuracy as compared to original||ASCII 7- or 8-bit||HTML, XML, SGML, RTF|
|Compound document, in Portable Document Format (PDF)||Text and images as indicated above|
Audio and motion
Formats and compression: Any of these are acceptable: Microsoft Wave (.wav), MPEG (.mp3, .mpg, .mpeg), "Audio Video Interleave" for Windows (.avi), QuickTime (.qt, .mov), RealMedia (.rm, .ra, .ram).
|Spoken word||11-22 kHz sampling, 16 bit, mono|
|Music||44.1 kHz sampling, 16 bit, stereo|
|Video||320x240 30 fps/1.2kbps|