[LIS-Forum] Inside The Library of Congress's Mission To Organize 170 Billion Tweets

5 Jan 2013

      WASHINGTON (AP) — The Library of Congress says it's amassed about 170
billion tweets since it began collecting an archive of all Twitter
messages in 2010.
Twitter is donating its archive to the library, going back to the
first one posted in 2006.
Library Director of Communications Gayle Osterberg wrote in a blog
post Friday that the volume of tweets it receives has grown from 140
million daily in February 2011 to nearly half a billion tweets each
day in late 2012.
Librarians have been developing a system to preserve and organize the
collection. Now the library is shifting its focus to handle the
technical challenges of making such a massive archive available to
researchers.
The library may work with a private partner to provide access because
its own search technology is slow.
Inside The Library of Congress's Mission To Organize 170 Billion Tweets
By Sarah Kessler
|
January 4, 2013
Turning 170 billion tweets into a usable archive is a challenge that
the nation's largest library has never seen before. The Deputy
Librarian of Congress talks with Fast Company about where to start.

When the Library of Congress began archiving new tweets in February
2011, it would transfer about 140 million new tweets each day from
temporary servers onto magnetic film. By the next October, that number
had soared to about 400 million tweets per day. There are now about
170 billion tweets in the archive, which also includes a collection of
all tweets going back to 2006 that the library acquired from Twitter
in 2010. The library’s two compressed copies of the data total 133.2
terabytes.
“It’s a few cabinets of tape,” explains Robert Dizard, the deputy
librarian at the Library of Congress. “It’s not a roomful or roomfuls.
And that’s just a testament of the storage capacity of tape.”
The size of Twitter is on the order of the size of the Internet, just
in tweets instead of web pages.
Since the library signed an agreement with Twitter in 2010 that gave
it access to historical tweets, a small team of its staff have been
working to establish a sustainable process for acquiring and storing
tweets. Now they’re transitioning their efforts to the significantly
more difficult challenge of providing access to the archive they've
built. Running a search term through a body of data as big as the
tweet archive, Dizard says, could take more than 24 hours. The
archived tweets are already indexed by time, but an hour of tweets
could contain millions of 140-character snippets--not so helpful for
someone doing research about a specific topic.
The library has experience with large digital collections. It
regularly archives, for instance, websites, government databases, and
policy events. But Twitter is new territory. “It’s not only very
large,” Dizard says. “It’s expanding daily and at an increasing
velocity. The variety of tweets is high.”
Not even Twitter, which employs some of the best engineers in Silicon
Valley, has attempted to create a searchable archive of tweets. That’s
partly because the commercial demand for historical access pales in
comparison to that for real-time advertising. But the massive server
space and resources such a project would consume are certainly another
factor. Jamie de Guerre, VP Product at Topsy, a private company that
provides some access to the Twitter archive, compares the task of
indexing Twitter to indexing the entire Internet.
“Google’s index of the entire Internet ranges from about, in some
estimates, 45 billion web pages to 125 billion web pages,” he told
Fast Company in a recent interview. “So the size of Twitter is on the
order of the size of the Internet, just in tweets instead of web
pages. Having all of that data available, being able to query across
and return a large data file to a user is definitely quite a
challenge.”
In its first step to addressing the challenge, the library is talking
with third-party companies that could potentially manage access to the
archive. Given current resources, a solution for access is not likely
to be a search engine that can locate a specific tweet. Dizard says he
has no idea how a viable solution might look, but that abandoning the
project isn't an option.
“Our mission is to collect, preserve, and provide access to creative
and historical record of America,” he says. “We’re looking at Twitter
from a research and scholarship perspective as providing a reflection
of everyday life as well as showing the development and impact of
significant events. You also have the record and recordings of
individuals. Which are also valuable.”
“I look at Twitter as a start of what the Library will be doing in the
medium and long-term,” Dizard says. “Not a test of whether we’ll
collect social media at all.”

-- 
Warm regards.

Jayadev P Hiremath
Independent Libraries Professional
(Former Librarian - IBS,Kuwait)
E-mail      :  jayadevh@hotmail.com

LinkedIn  :  http://www.linkedin.com/pub/jayadev-p-hiremath/44/12a/2a0
Facebook  : http://www.facebook.com/people/Jayadev-P-Hiremath/603802230

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.