Head-To-Head: ACAP Versus Robots.txt For Controlling Search Engines

Danny Sullivan put up a great (and very long) post comparing ACAP and Robots.txt in the Context of the current discussion around paid content and the Hamburg declaration. I urge you to read it in full if you want to know more about the current situation, and why ACAP will not help the publishers to pursue their hamburg declaration goals.

Disclosure: After some critical posts regarding ACAP and due to the fact that i’m working at a news agency i was invited to join the ACAP technical working group. I attended one face-to-face meeting and a couple of phone conferences, mainly there was interest to integrate news agency use cases into ACAP.  I stopped active work in the TWG basically a year ago, mainly due to the following reasons:

  • dpa has no B2C business and FTP / satellite not HTTP are still the major news delivery mode :-(
  • The general situation regarding ACAP is as Danny describes it
  • Hence there are more efficient uses of my precious time than the ACAP TWG

To give you an idea what Danny is talking about i’ll include a quote fromhis post showing that even the protagonists are not really using ACAP :

Sounds easy enough to use ACAP, right? Well, no. ACAP, in its quest to provide as much granularity to publishers as possible, offers what I found to be a dizzying array of choices. REP explains its parts on two pages. ACAP’s implementation guide alone (I’ll get to links on this later on) is 37 pages long.

But all that granularity is what publishers need to reassert control, right? Time for that reality check. Remember those 1,250 publishers? Google News has something like over 20,000 news publishers that it lists, so relatively few are using ACAP. ACAP also positions itself as (I’ve bolded some key parts):

an open industry standard to enable the providers of all types of content (including, but not limited to, publishers) to communicate permissions information (relating to access to and use of that content) in a form that can be readily recognized and interpreted by a search engine (or any other intermediary or aggregation service), so that the operator of the service is enabled systematically to comply with the individual publisher’s policies.

Well, anyone with a web site is a publisher, and there are millions of web sites out there. Hundreds of millions, probably. Virtually no publishers use ACAP.

Even ACAP Backers Don’t Use ACAP Options

Of course, there’s no incentive to use ACAP. After all, none of the major search engines support it, so why would most of these people do so. OK, then let’s look at some people with a real incentive to show the control that ACAP offers. Even if they don’t yet have that control, they can still use ACAP now to outline what they want to do.

Let’s start with the ACAP file for the Irish Independent. Don’t worry if you don’t understand it, just skim, and I’ll explain:

##ACAP version=1.0

# Allow all

User-agent: *

Disallow: /search/

Disallow: /*.ece$

Disallow: /*startindex=

Disallow: /*from=*

Disallow: /*service=Print

Disallow: /*action=Email

Disallow: /*comment_form

Disallow: /*r=RSS

Sitemap: http://www.independent.ie/sitemap.xml.gz

# Changes in Trunk

ACAP-crawler: *

ACAP-disallow-crawl: /search/

ACAP-disallow-crawl: /*.ece$

ACAP-disallow-crawl: /*startindex=

ACAP-disallow-crawl: /*from=*

ACAP-disallow-crawl: /*service=Print

ACAP-disallow-crawl: /*action=Email

ACAP-disallow-crawl: /*comment_form

ACAP-disallow-crawl: /*r=RSS

OK, see that top part? Those are actually commands using the robots.txt syntax. They exist because if a search engine doesn’t understand ACAP, the robots.txt commands serve as backup. Basically those lines tell all search engines not to index various things on the site, such as print-only pages.

Now the second part? This is where ACAP gets to shine. It’s where the Irish Independent — which is part of the media group run by ACAP president Gavin O’Reilly — gets to express what they wish search engines would do, if they’d only recognize all the new powers that ACAP provides. And what do they do? EXACTLY the same blocking that they do using robots.txt.

So much for demonstrating the potential power of ACAP.

ACAP – The strawman proposals

In a couple of hours i’m going to attend the “International Conference on the Conclusion of the ACAP Pilot” at AP’s premises here in NYC. (This has given me the chance to see some of my AP friends :-) ).

For those of you who don’t know ACAP. According to the website:

Following a successful year-long pilot project, ACAP (Automated Content Access Protocol) has been devised by publishers in collaboration with search engines to revolutionise the creation, dissemination, use, and protection of copyright-protected content on the worldwide web.

ACAP is set to become the universal permissions protocol on the Internet, a totally open, non-proprietary standard through which content owners can communicate permissions for access and use to online intermediaries.

IMHO, ACAP is first and foremost a much needed “neutral ground” where publishers can meet, exchange ideas, and start joint lobbying efforts etc. Don’t get me wrong, i really think that such a place was missing, and hence i more or less talked my employer, the german news agency dpa, into becoming a ACAP member.

I also think that there is lack of a “standard” way to communicate commercial content rights to end-users and search-engines etc. Hence my primary reason for joining was to be able to have a closer look at the developed technology, in order to be able to judge whether this technology would contribute to close this gap. Hence i already have written a couple of times on the topic of ACAP on this blog.

ACAP – the promise

The following is stated on the invitation to the conference:

As ACAP reaches the final phase of its 12-month pilot, representatives of the
publishing and online community will be showcasing the successful development of
the new, open standard through which the owners of content published on the World
Wide Web can provide permissions information (relating to access and use of their
content) in a form that can be recognised and interpreted automatically, so that
search engine operators and other online intermediaries are enabled systematically
to comply with policies established by content owners.

ACAP will allow publishers, broadcasters and indeed any other publisher of content
on the network to express their individual access and use policies in a language that
search engine robots and similar automated tools can read and understand.

This conference will demonstrate beyond all doubt, the need for ACAP and the
potential disaster for the global publishing industry should it fail to embrace new
technology to protect its future.

Big words. So, later today a lot of publishing bigwigs will be at the conference, things will be announced, politicians are going to speak, and “i will be one of the few attendees that actually cared to read the technical documents”.

I’ve taken the time to read the:

  • Strawmans proposals (part I and II), and the
  • Usage definitions

Unfortunately i had only the Semptember documents with me. So i wasn’t able to check (until now), if there are any significant differences in the October version and /or the final versions, that just have been put up some 30h ago.

Too bad, that there are neither documents highlighting the edits between versions, nor some easy way to run a diff on the versions. (Hint: There is a reason why RFCs are still ordinary texts.)

So, to make it very clear. What follows is based on my loose acquaintance with the project (i.e. i attended the first conference in London) and a one-time thorough reading of the september draft specs (on a flight while having a terrible headache) . So i might be terribly wrong. If so please tell me.

ACAP – My verdict

While being very successful at organising a common platform for publishers, ACAP fails big time to convince me that the proposed technical solution is actually going to be the solution to problems as stated above.

And while i’m the first to agree that the publishing industry has to embrace new technologies in order to avoid potential disaster, i think ACAP carries more of a backward oriented, let’ s protect our territory, attitude than a forward looking, let’s explore new worlds thinking.

Single use case only

On one hand it is too focused on a single use case: Telling search engines what are they allowed to do with content residing on web sites.

What about readers? They also want to know what they are allowed to do with the content they are reading. Are they allowed to put the about them on their personal website, on their blog?

What about content not residing as HTML/XHTML pages on websites? News agencies still deliver their content mainly via wires (via satellite or FTP). What about RSS and ATOM feeds, the standard content delivery format in “the developed countries” of the internet. What about content reuse in Facebook / OpenSocial apps?

What about images, audio, video and a way to embed the rights into the original data?

Update: In the talks at the conference it became clear that there is some kind of roadmap for the other use cases, especially the syndication use case, which I’m most interested in my daily profession. This use case is going to be based on an XML format (which itself is to be based on the ONIX proposals). Hopefully this will be drastically reduced in complexity (see my earlier posts on ONIX). If anything works out well, i might even contribute to that format.

Wrong granularity

Instead of looking at theses broader issues , ACAP focuses on ways to define on a very fine level of granularity, what search engines are allowed to do with the content residing on webpages. This leads to the possible creation of very complex permission sets. Permission sets that lead to a cmputation order of complexity, that makes it practically (may be even theoretically) impossible to implement for search engines. Especially the permissions / restrictions defined on the present verb are very fine grained and lead to very complex renderings, that in addition, given the presumable striking differences between permission given by differente publishers, lead to visual disaster on the search result pages.

Bad technical quality of the specs

I tried to read the documentation the same way i did read term papers of students or research papers of colleagues while revieing the papers back in my teaching /researching days. That means:Trying very hard to understand what was written, scribbing remarks and question marks when i didn’t get it the first time reading, checking for completeness of the presentation, looking out for contradictions, self-containedness of all necessary information etc. You know it.

And i have to say that i can’t remember a term-paper /research-paper or thesis, that even in the earliest versions has been of such a bad technical quality as the September ACAP documentation. May be i’m getting old and do not remember correctly, and there have been some, but definitely not very many.

So i hope for an improvement in the final documentation, because in the september release the documentation fails miserably in fulfilling its self proclaimed primary requirement:

Fundamentally, ACAP requires consistent and unambiguous interpretation of all its
permissions.

Update: I now had the chance to look at the 1.0 specs and things definitely look better. In talks with the project participants it als became clear that (as usual) the docs had to been pushed out on a rush and other more refined documentation is on its way. I also was able to get the basic question if the basic model is permissive or restrictive, an information that is missing from the documents. It is permissive like the model of robots.txt. But this may change with a 2.0 version when ACAP is no longer that closely intertwined with the REP.

Where do go from here

A year ago, directly after the ACAP announcement i wrote the following on this blog:

As often noticed in there is already a de facto standard protocol (the robots exclusion protocol) which is machine readable and that tells search engines which content (not) to spider. So if a newspaper wants a search engine not to index her pages all they have to do is to include an appropriate robots.txt file. Furthermore there are also machine readable means (e.g. the creative commons license framework) for automatically communicating the terms under which a content can be used.

Unfortunately the the robot exclusion protocol is not an “official” standard e.g. by the W3C , and the “creative commons” framework doesn’t cover possibilities to list exceptions to the various restrictions imposed by the license or in some way ease the way of waiving the restrictiond by (semi-)automatically getting the permission from the rightsholder.

So there definitely is room for improvement on both. As long as ACAP builds on these lightweigth and broadly accepted standards, i’m interested in it. In might be useful and it might even be used.

Looking at these sentences today, i have to say that at least ACAP tried to build on the REP. But by broadening their stated goals to a solution to the whole publishing industry, a REP based solution is definitely not enough (see above).

And they completely neglected creative commons, IMHO a major mistake.

Creative Commons – the better ACAP?

CC tries to define different common use cases on the scale from “All rights reserved” to “Public domain”, leaning definitely in the half where more rights are granted than reserved. Typically the rights publishers have in mind are traditionally in the other half, making a perfect complementary fit.

And years and years of development have gone into supporting tools for CC, search engine enhancements supporting CC etc, not to mention all the work that has been spent in adapting the licenses to the local jurisdictions.

In addition to that publishers sooner or later will publish cc-ed content, so they have to know and implement cc in their processes.

Hence, IMHO building on top of CC would have been definitely the better way to create ACAP. But i guess that this way was politically not feasible for the publishing industry.

Update:  I was happy to hear that ACAP is going to talk to Creative Commons soon and also recognizes that creative Commons is especially interesting in the non search engine use cases.

PS.: I try to write a second post with a technical critique of the September ACAP documentation. But since i’m leaving for a 2 1/2 weeks holiday tomorrow, i’m not sure if this is going to happen soon