Library of Congress Partnership to Preserve Federal Agency Websites

The Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive and the U.S. Government Printing Office recently announced a collaborative project to preserve public United States Government web sites at the end of the current presidential administration ending January 19, 2009. This harvest is intended to document federal agencies’ online archive during the transition of government and to enhance the existing collections of the five partner institutions.

As part of this collaboration, the Internet Archive will undertake a comprehensive crawl of the .gov domain. The Library of Congress has been preserving congressional Web sites on a monthly basis since December 2003 and will focus on development of this archive for the project. The University of North Texas and the California Digital Library will focus on in-depth crawls of specific government agencies.

The project will also call upon government information specialists — including librarians, political and social science researchers, and academics — to assist in the selection and prioritization of web sites to be included in the collection, as well as identifying the frequency and depth of the act of collecting. The Government Printing Office will lend expertise to the curation process along with libraries in its Federal Depository Library Program. A tool has been designed by the project team and developed by the University of North Texas to facilitate the collaborative work of these specialists, and will be made available to participants in August 2008.

“Digital government information is considered at-risk, with an estimated life span of 44 days for a web site. This collection will provide an historical record of value to the American people,” said Director of Program Management Martha Anderson of the Library of Congress’ National Digital Information Infrastructure and Preservation Program (NDIIPP).

The Library of Congress, the world’s preeminent reservoir of knowledge, is leading a nationwide program to collect and preserve at-risk digital content of cultural and historical importance. The program, formally called the National Digital Information Infrastructure and Preservation Program is building a digital preservation network of partners. More information about the Library’s Web Capture program is available by clicking here.

The California Digital Library leads the NDIIPP funded Web-at-Risk project, which is developing tools that enable librarians and archivists to capture, curate, preserve, and provide access to web-based government and political information. In partnership with the University of California libraries, the California Digital Library established the digital preservation program to ensure long-term access to the digital information that supports and results from research, teaching and learning at the university

The University of North Texas Libraries, as part of the Federal Depository Library Program, created the CyberCemetery in 1997, to capture and provide permanent public access to the web sites and publications of defunct U.S. government agencies and commissions. The University of North Texas participates in the NDIIPP program as a partner with the California Digital Library in the Web-at-Risk project focusing on the selection of materials for capture and preservation. The U.S. Government Printing Office manages the Federal Depository Library Program and is charged with providing permanent public access to government publications.

The Internet Archive is a high-tech nonprofit, founded in 1996 by Brewster Kahle as an “Internet library” to provide universal and permanent access to digital information for educators, researchers, historians, and the general public. The Internet Archive captures, stores and provides access to born-digital and digitized content, and leads the development of Heritrix, the open-source archival web crawler, used to facilitate the collection of web data for this project.

On March 27, 2008, the National Archives and Records Administration (NARA) issued a memorandum, stating that it would not conduct a “harvest” of federal agency websites, as they exist at the end of President Bush’s term as they did in 2001 and 2005. In response to concerns expressed by stakeholders, on April 15, the National Archives issued further clarification, stating “each agency is now responsible, in coordination with NARA, for determining how to manage its web records, including whether to preserve a periodic snapshot of its entire web page.”

The March 27 memo cited cost concerns and the fact that non-governmental websites provide federal website snapshots as reasons for NARA’s decision. The memo also pointed out that there is no NARA requirement for agencies to perform their own web snapshot at the end of the Administration if NARA does not conduct one.

In its April 15 statement, NARA expressed concern that some federal agencies would assume “If NARA is taking a web snapshot, why does our agency need to manage our own records?” NARA also questioned “the permanent archival value of a Federal agency web snapshot taken on one random day near the end of a Presidential term.”

NARA stated it would conduct a web harvest of Congressional web sites since the Federal Records Act does not cover them. In addition, NARA will also receive a web snapshot of the White House website since it is considered a permanent record under the Presidential Records Act.