Head-To-Head: ACAP Versus Robots.txt For Controlling Search Engines

Danny Sullivan put up a great (and very long) post comparing ACAP and Robots.txt in the Context of the current discussion around paid content and the Hamburg declaration. I urge you to read it in full if you want to know more about the current situation, and why ACAP will not help the publishers to pursue their hamburg declaration goals.

Disclosure: After some critical posts regarding ACAP and due to the fact that i’m working at a news agency i was invited to join the ACAP technical working group. I attended one face-to-face meeting and a couple of phone conferences, mainly there was interest to integrate news agency use cases into ACAP.  I stopped active work in the TWG basically a year ago, mainly due to the following reasons:

  • dpa has no B2C business and FTP / satellite not HTTP are still the major news delivery mode :-(
  • The general situation regarding ACAP is as Danny describes it
  • Hence there are more efficient uses of my precious time than the ACAP TWG

To give you an idea what Danny is talking about i’ll include a quote fromhis post showing that even the protagonists are not really using ACAP :

Sounds easy enough to use ACAP, right? Well, no. ACAP, in its quest to provide as much granularity to publishers as possible, offers what I found to be a dizzying array of choices. REP explains its parts on two pages. ACAP’s implementation guide alone (I’ll get to links on this later on) is 37 pages long.

But all that granularity is what publishers need to reassert control, right? Time for that reality check. Remember those 1,250 publishers? Google News has something like over 20,000 news publishers that it lists, so relatively few are using ACAP. ACAP also positions itself as (I’ve bolded some key parts):

an open industry standard to enable the providers of all types of content (including, but not limited to, publishers) to communicate permissions information (relating to access to and use of that content) in a form that can be readily recognized and interpreted by a search engine (or any other intermediary or aggregation service), so that the operator of the service is enabled systematically to comply with the individual publisher’s policies.

Well, anyone with a web site is a publisher, and there are millions of web sites out there. Hundreds of millions, probably. Virtually no publishers use ACAP.

Even ACAP Backers Don’t Use ACAP Options

Of course, there’s no incentive to use ACAP. After all, none of the major search engines support it, so why would most of these people do so. OK, then let’s look at some people with a real incentive to show the control that ACAP offers. Even if they don’t yet have that control, they can still use ACAP now to outline what they want to do.

Let’s start with the ACAP file for the Irish Independent. Don’t worry if you don’t understand it, just skim, and I’ll explain:

##ACAP version=1.0

# Allow all

User-agent: *

Disallow: /search/

Disallow: /*.ece$

Disallow: /*startindex=

Disallow: /*from=*

Disallow: /*service=Print

Disallow: /*action=Email

Disallow: /*comment_form

Disallow: /*r=RSS

Sitemap: http://www.independent.ie/sitemap.xml.gz

# Changes in Trunk

ACAP-crawler: *

ACAP-disallow-crawl: /search/

ACAP-disallow-crawl: /*.ece$

ACAP-disallow-crawl: /*startindex=

ACAP-disallow-crawl: /*from=*

ACAP-disallow-crawl: /*service=Print

ACAP-disallow-crawl: /*action=Email

ACAP-disallow-crawl: /*comment_form

ACAP-disallow-crawl: /*r=RSS

OK, see that top part? Those are actually commands using the robots.txt syntax. They exist because if a search engine doesn’t understand ACAP, the robots.txt commands serve as backup. Basically those lines tell all search engines not to index various things on the site, such as print-only pages.

Now the second part? This is where ACAP gets to shine. It’s where the Irish Independent — which is part of the media group run by ACAP president Gavin O’Reilly — gets to express what they wish search engines would do, if they’d only recognize all the new powers that ACAP provides. And what do they do? EXACTLY the same blocking that they do using robots.txt.

So much for demonstrating the potential power of ACAP.

Die Schizophrenie des Kai Diekmann

Fall 1: Böses Internetunternehmen macht schwer zugängliche Informationen leicht zugänglich, cached sie ggf. und schaufelt den Content-Erstellern Traffic und damit Umsätze zu.

Konsequenz: Hamburger Erklärung etc.

Fall 2: Gute große Zeitung, zitiert kleine Zeitschrift, Interesse wird erzeugt und mit ein bisschen Glück kaufen möglichst viele Zeitungsleser auch das Original-Magazin. (siehe  KAI DIEKMANNS BLOG.)

BILD wollte es für seine Leser natürlich genauer wissen. Denn dass das Interview für so viel Wirbel sorgte, hatte ja auch damit zu tun, dass kaum jemand den Zusammenhang kannte, in dem Sarrazins Zitate standen.

Deshalb druckten wir in der Zeitung einen längeren Auszug und stellten den ganzen Text ungekürzt ins Internet – selbstverständlich mit ausführlicher Quellenangabe (online stand sogar der sehr lange Zusatz: “Die Zeitschrift Lettre International erscheint vierteljährlich und ist in ausgesuchten Buchhandlungen und im Zeitschrifthandel an Flughäfen und Bahnhöfen erhältlich. Einzelheftpreis: 17 Euro, Jahresabo: 41 Euro. Hier abonnieren: http://www.lettre.de/aboheft.html”).

So weit, so normal: Große Zeitung zitiert kleine Zeitschrift, Interesse wird erzeugt und mit ein bisschen Glück kaufen möglichst viele Zeitungsleser auch das Original-Magazin.

Offenbar nicht. Denn jetzt sieht alles etwas anders aus. Redaktionsleiter Berberich schimpft wie ein Rohrspatz über BILD, redet von „Diebstahl“ seines Textes, will Schadensersatz und lässt sich dabei von meinem Freund Jony Eisenberg vertreten….

Lieber Herr Berberich: Da müssen Sie etwas übersehen haben. Denn natürlich haben wir das Interview nicht geklaut, sondern uns vorher die Erlaubnis zur Veröffentlichung geholt. Mein Kollege Hans-Jörg Vehlewald aus der Politikredaktion bat dafür telefonisch in ihrem Büro um den kompletten Text, den er anschließend auch per Fax bekam – versehen mit dem handschriftlichen Vermerk: „z. Hd. Herrn Vehlewald, mit Nennung der Quelle: Lettre International“ siehe ganz unten.Nun muss man BILD ja nicht mögen. Und man kann sich auch immer alles anders überlegen. Aber uns zuerst den Abdruck zu erlauben, und dann davon nichts mehr wissen zu wollen, finde ich…komisch.

Commented on “How (and why) to replace the AP” at buzzmachine.com

My comment on Jeff Jarvis blog post on APs news registry and tracking proposal:

Update: This comment never made it through moderation :-(

Jeff, some remarks from germany:

1)  Wrt. Hamburger Erklärung: In contrast to the US,  in germany there is a supreme court decision dated  in 2000 (or was it 2001?), the so called “Paperboy / Paperball” decision , that ok’s  the use of headline, text snippets and deep link  on news aggregators sites. Things are completely different  for images.

Hence the german publisher have no “attack vector” from this angle. So they lobby hard  to come up with some new rights as “Werkmittler” and change so law  in order to  get new rulings that overcome the paperpoy decision. Basically that is the essence / background of the Hamburg declaration.

2) You can find more stuff about the Hamburger Erklärung as well as the differences betwenn the englich and the german version on my blog http://relations.ka2.de/2009/07/14/hamburg-declaration-wording/ and  http://relations.ka2.de/2009/06/10/hamburger-erklaerung-wortlaut/ (german). BTW: the english version is already way better than the german version

3) Tracking the use and sharing of the content and then sharing of the revenues  as described in the blog post is IMHO the basic idea of  fairsyndicaton.org. But running the infrastructure is expensive and getting the ad networks to sign up is difficult. But fairsyndication.org already gained some momentum.

So from a foreigners perspective in the US its now an arms race between the AP news registry and fairsyndication.org.

4) I think the best way  to do reverse syndication between media outlets is to go the NYT and especially the Guardian route and offer an API for accessing the content. The problem with this is that each media company right now is coming up with its own APi and content / wire format.

Hence the microformat  for news proposed by AP  and the Media Standards Trust  is very interesting to me and should not be intermixed with the active tracking of the content (which IMHO) is a difficult thing to achieve if you have to deal with non-cooperative “customers”.

The worst thing that could happen ist that the tracking part has a negative impact on the microformat proposal. Unfortunately it is also the most likely thing to happen.

5) There is already a de facto news registry: It is called Google News.  They also have the technology for tracking content re-use on a web scale, as a scientific paper called “Detecting the origin of text segments efficiently” www2009.eprints.org/7/1/p61.pdf (PDF) from this years W3C conference shows.

Bold / Crazy idea: Maybe Google should consider to move Google News under the non-profit google.org (set up a dedicated non-profit rights registry, optimally omitting the problems they have with the BRR). They could even leave their AP licensed news at news.google.com  and monetise it.