
WASHINGTON (AP) — The Library of Congress says it's amassed about 170 billion tweets since it began collecting an archive of all Twitter messages in 2010. Twitter is donating its archive to the library, going back to the first one posted in 2006. Library Director of Communications Gayle Osterberg wrote in a blog post Friday that the volume of tweets it receives has grown from 140 million daily in February 2011 to nearly half a billion tweets each day in late 2012. Librarians have been developing a system to preserve and organize the collection. Now the library is shifting its focus to handle the technical challenges of making such a massive archive available to researchers. The library may work with a private partner to provide access because its own search technology is slow. Inside The Library of Congress's Mission To Organize 170 Billion Tweets By Sarah Kessler | January 4, 2013 Turning 170 billion tweets into a usable archive is a challenge that the nation's largest library has never seen before. The Deputy Librarian of Congress talks with Fast Company about where to start. When the Library of Congress began archiving new tweets in February 2011, it would transfer about 140 million new tweets each day from temporary servers onto magnetic film. By the next October, that number had soared to about 400 million tweets per day. There are now about 170 billion tweets in the archive, which also includes a collection of all tweets going back to 2006 that the library acquired from Twitter in 2010. The library’s two compressed copies of the data total 133.2 terabytes. “It’s a few cabinets of tape,” explains Robert Dizard, the deputy librarian at the Library of Congress. “It’s not a roomful or roomfuls. And that’s just a testament of the storage capacity of tape.” The size of Twitter is on the order of the size of the Internet, just in tweets instead of web pages. Since the library signed an agreement with Twitter in 2010 that gave it access to historical tweets, a small team of its staff have been working to establish a sustainable process for acquiring and storing tweets. Now they’re transitioning their efforts to the significantly more difficult challenge of providing access to the archive they've built. Running a search term through a body of data as big as the tweet archive, Dizard says, could take more than 24 hours. The archived tweets are already indexed by time, but an hour of tweets could contain millions of 140-character snippets--not so helpful for someone doing research about a specific topic. The library has experience with large digital collections. It regularly archives, for instance, websites, government databases, and policy events. But Twitter is new territory. “It’s not only very large,” Dizard says. “It’s expanding daily and at an increasing velocity. The variety of tweets is high.” Not even Twitter, which employs some of the best engineers in Silicon Valley, has attempted to create a searchable archive of tweets. That’s partly because the commercial demand for historical access pales in comparison to that for real-time advertising. But the massive server space and resources such a project would consume are certainly another factor. Jamie de Guerre, VP Product at Topsy, a private company that provides some access to the Twitter archive, compares the task of indexing Twitter to indexing the entire Internet. “Google’s index of the entire Internet ranges from about, in some estimates, 45 billion web pages to 125 billion web pages,” he told Fast Company in a recent interview. “So the size of Twitter is on the order of the size of the Internet, just in tweets instead of web pages. Having all of that data available, being able to query across and return a large data file to a user is definitely quite a challenge.” In its first step to addressing the challenge, the library is talking with third-party companies that could potentially manage access to the archive. Given current resources, a solution for access is not likely to be a search engine that can locate a specific tweet. Dizard says he has no idea how a viable solution might look, but that abandoning the project isn't an option. “Our mission is to collect, preserve, and provide access to creative and historical record of America,” he says. “We’re looking at Twitter from a research and scholarship perspective as providing a reflection of everyday life as well as showing the development and impact of significant events. You also have the record and recordings of individuals. Which are also valuable.” “I look at Twitter as a start of what the Library will be doing in the medium and long-term,” Dizard says. “Not a test of whether we’ll collect social media at all.” -- Warm regards. Jayadev P Hiremath Independent Libraries Professional (Former Librarian - IBS,Kuwait) E-mail : jayadevh@hotmail.com LinkedIn : http://www.linkedin.com/pub/jayadev-p-hiremath/44/12a/2a0 Facebook : http://www.facebook.com/people/Jayadev-P-Hiremath/603802230 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.