Some weeks you talk to your colleagues and watch a documentary and feel unprepared to understand the world. This was one of those weeks. Ross was back in Camden and the last time he was here he had helped me take baby cloud steps while he got started on WikiReverse.
Two and bit years later WikiReverse has been launched and this is what I learnt.
There is this thing. Its called the Common Crawl.
Really – you have crawled the internet again. Again ? Haven’t we been doing this since gopher was mistaken for a red squirrel and everything was still in black and white, or was that black and green ?
Unlike Ross I wasn’t really that excited, when, back in 2012 Common Crawl, using Nutch, started dumping everything they found on the Internet into Amazon (AWS). Making it accessible for free (yes as in freedom).
But today I am much more interested in the Common Crawl project because…
Having this Data in the Public Domain Matters.
Yes we understand that the Western view of the Internet is monopolised by Google where once we used Yahoo and Excite we now just use Google. It’s not actually true the world over. In China they use Baidu and in Prague (trying to find the Fifty away day venue) concierges used a local engine telling me that Google was too slow. Still the idea that Common Crawl may provide an alternative to the now evil Google seems to intrigue the press and people wanting to fight back.
Rather than being a “rebel fighter sitting in an unmarked, unconnected, no-power cave wearing a tinfoil hat” you could be a rebel fighter sitting in you unmarked, ungooglable, cave wearing a tinfoil hat, and white Y-fronts (don’t forgot those) writing the next search engine. Unlikely. Not least the white Y-fronts.
Actually my issue (no its not underwear) is that the pace of technology change is too slow and a bit hit and miss. Therefore any attempt to democratise and enable change should be supported. If this means we may need historic web data then let the coders be given it. Especially if you are giving it to a twelve year old hacker who…
Does Not Want to Work at Google.
We have watched The Internship who wants to go through that? Well actually at some point clever people with new web data ideas felt compelled to work at Google as it was the only place to get access to search data. A monopoly on work I do have problem with, humanity needs diversity and research. Research. In some ways…
Common Crawl is for the Academics
My lack of excitement initially, was because I like applications and couldn’t really conceive of one with Common Crawl with its limited data sets but I can now see a world of new academic research based Common Crawl. Which should give rise to lots of new ideas.
Academia asks questions and here we can start to ask questions that straight search of current content on the web would never answer. In fact…
Its about Time Travel Matrix Algebra Searching
A mathematical Matrix is an array and if you could organise a search array where one of the dimensions is the web page changing over time then what questions would you ask ?
Maybe you need to compare multiple editions of the exact same page. Or maybe you need to monitor how a meme has rippled out through the web over time. Lisa Green, Director of Common Crawl wants to determine the etymology of words Like the fabled Selfie. Like Kleppman you could collate the “best” articles on a subject and the best counter arguments. Like adwf I have always wanted to fact check the top news sites in a quick search. Or as Kleppman puts it, know in real time why the thing you’re reading is full of shit.
OK excited ? Don’t be. Lets put to one side how hard this would be (its ok we shall get those clever post docs for free right). And also leave aside that it’s time consuming. We have a real limitation,
We Need More Crawl Data, More Often
As Kleppman says with a nice graph
“The CommonCrawl data set is about 35 TB in size (unparsed, compressed HTML). That’s a lot, but Google says they crawl 60 trillion distinct pages, and the index is reported as being over 100 PB, so it’s safe to assume that the CommonCrawl data set represents only a small fraction of the web”
Zhao notes that the interesting and valuable parts of the Web won’t be well represented in Common Crawl’s data. Anything behind passwords or inside social media can’t be crawled. On the flip side we are ….
This is overly melodramatic and conjures up strange images of data warlords. One has to worry that the most fundamental daily tool we use to understand our world, is being manipulated to remove perceived controversial results and massage a sanitised view.
Which just means that the 12 year old with her code base and Common Crawl might have a much truer view of the web. Therefore ….
We need more people using Common Crawl
Which was Ross’s stated goal behind Wikireverse. He wanted to process an entire crawl and then release both the data and the code used to generate it.
Along side him and an online community is Factual who have a commercial view of the Common Crawl project. (Video)
Common Crawl Should Remain Accessible to All
The most interesting thing about Common Crawl was its move to Amazon.
Any transfer within the amazon network are cost free. This will allow you to process terabytes of data and all you would pay for is the AWS machine instances and the storage cost of your analysis data. Neat!
For me the genius of Wikireverse has been the way Ross has utilised Amazon to do his project on…
65$ of hosting, and you learn so much.
There is More To Do
If you have got this far and want more (because I do) then can I suggest you check out Ross’s blog, follow Ross on Twitter and come see us at Fifty where you won’t find any tin hats or data tyrants but you will find us using Common Crawl (as well as Google).