The Library of Congress Loves Every Kind of Tweet, but It Can’t, Can’t Search Optimize Every Tweet
by Susana Polo | 4:19 pm, January 4th, 2013
Back in 2010, the U.S. Library of Congress announced that it would be archiving every public tweet made since 2006. They’re back again today, to say “Yeah, so. You guys tweet, like, a lot.”
In the nearly three years since announcing their initiative, the Library of Congress has actually managed to create its archive of every public tweet between 2006 and 2010, and has developed a complete system for taking in and archiving everything that comes out of twitter and saving it to their servers. That means that these days, they’re taking in about half a billion tweets per day. So what’s the problem, you ask?
Well, the Library of Congress, being a place where people do research, doesn’t feel like the archive is really up to their standards yet. Primarily because performing one search query on it can take about twenty four hours to complete. From their announcement:
The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing.” To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is costprohibitive and impractical for a public institution.
The Library’s ultimate goal is to create an archive that offers “free, indexed, and searchable access” to legislative researchers and scholars, and the fact is that the technology simply isn’t there yet. But the Library is working on it, with folks from Twitter and Gnip, the company that collects their tweets for them, and with researchers themselves.
But until they work things out, all that embarassing stuff you tweeted while watching “The End of Time” and hope no one ever finds again is safe.