Do The Robot: How Publishers Dodge The Spiders

Content protectionism is heating up. After Rupert Murdoch last year complained about sites that “steal” News Corp (NYSE: NWS) stories, his UK papers this week stopped the British aggregator NewsNow from indexing their sites.

These big walls are being built with one small file – the robots.txt exclusion standard, which lets website owners like News Corp dictate how search sites crawl around them.

Google (NSDQ: GOOG) specifically advises any copy-averse publishers to roll out the robots. And, toward some search engines (though not yet to Google), News Corp is doing just that. Here’s how Murdoch is already using the protocol to block more than just NewsNow – and how other publishers are deploying robots.txt, too…

Times Online, Sun Online and NOTW.co.uk: All now block…
NewsNow
— The Alexa search engine
— discussion search engine Omgili
— Web spider WebVac, used by Stanford University
WebZip, an application to save entire websites offline
PicSearch, a Swedish image search site

WSJ.com: Blocks only MSNPTC 1.0, a spider believed to be for Microsoft’s adCenter which is also barred by Marketwatch.com.

New York Post: Blocks MetaCarta, a service that crawls news stories to extract geographic information for placing on maps and services.

Other habits…

— None of the other main UK newspaper sites (Guardian.co.uk, Telegraph.co.uk, Mail Online, Independent.co.uk and Mirror.co.uk) outright block any search services.

— But some don’t want to show off their mobile versions – Telegraph.co.uk blocks crawling of its cut-down story pages, and FT.com blocks Google’s mobile spider.

Mirror.co.uk uses robots.txt to hide several racy stories that it seems to have removed for legal reasons.

NYTimes.com: Blocks everyone from indexing content it publishes from numerous syndication partners (including paidContent.org) – and doubles up its wall against Google for this purpose by issuing duplicate “disallow” commands toward Googlebot. Also makes a specific point of clearing Google’s AdSense to crawl the site.

WashingtonPost.com: Stops many spiders from crawling photos, syndicated stories, audio and video files, text files and files that make up individual parts of story pages.

USAToday.com: Blocks everyone from crawling sports stories from before 1999, old Election 2008 pages and numerous system files.