Head-To-Head: ACAP Versus Robots.txt For Controlling Search Engines

Danny Sullivan put up a great (and very long) post comparing ACAP and Robots.txt in the Context of the current discussion around paid content and the Hamburg declaration. I urge you to read it in full if you want to know more about the current situation, and why ACAP will not help the publishers to pursue their hamburg declaration goals.

Disclosure: After some critical posts regarding ACAP and due to the fact that i’m working at a news agency i was invited to join the ACAP technical working group. I attended one face-to-face meeting and a couple of phone conferences, mainly there was interest to integrate news agency use cases into ACAP.  I stopped active work in the TWG basically a year ago, mainly due to the following reasons:

  • dpa has no B2C business and FTP / satellite not HTTP are still the major news delivery mode :-(
  • The general situation regarding ACAP is as Danny describes it
  • Hence there are more efficient uses of my precious time than the ACAP TWG

To give you an idea what Danny is talking about i’ll include a quote fromhis post showing that even the protagonists are not really using ACAP :

Sounds easy enough to use ACAP, right? Well, no. ACAP, in its quest to provide as much granularity to publishers as possible, offers what I found to be a dizzying array of choices. REP explains its parts on two pages. ACAP’s implementation guide alone (I’ll get to links on this later on) is 37 pages long.

But all that granularity is what publishers need to reassert control, right? Time for that reality check. Remember those 1,250 publishers? Google News has something like over 20,000 news publishers that it lists, so relatively few are using ACAP. ACAP also positions itself as (I’ve bolded some key parts):

an open industry standard to enable the providers of all types of content (including, but not limited to, publishers) to communicate permissions information (relating to access to and use of that content) in a form that can be readily recognized and interpreted by a search engine (or any other intermediary or aggregation service), so that the operator of the service is enabled systematically to comply with the individual publisher’s policies.

Well, anyone with a web site is a publisher, and there are millions of web sites out there. Hundreds of millions, probably. Virtually no publishers use ACAP.

Even ACAP Backers Don’t Use ACAP Options

Of course, there’s no incentive to use ACAP. After all, none of the major search engines support it, so why would most of these people do so. OK, then let’s look at some people with a real incentive to show the control that ACAP offers. Even if they don’t yet have that control, they can still use ACAP now to outline what they want to do.

Let’s start with the ACAP file for the Irish Independent. Don’t worry if you don’t understand it, just skim, and I’ll explain:

##ACAP version=1.0

# Allow all

User-agent: *

Disallow: /search/

Disallow: /*.ece$

Disallow: /*startindex=

Disallow: /*from=*

Disallow: /*service=Print

Disallow: /*action=Email

Disallow: /*comment_form

Disallow: /*r=RSS

Sitemap: http://www.independent.ie/sitemap.xml.gz

# Changes in Trunk

ACAP-crawler: *

ACAP-disallow-crawl: /search/

ACAP-disallow-crawl: /*.ece$

ACAP-disallow-crawl: /*startindex=

ACAP-disallow-crawl: /*from=*

ACAP-disallow-crawl: /*service=Print

ACAP-disallow-crawl: /*action=Email

ACAP-disallow-crawl: /*comment_form

ACAP-disallow-crawl: /*r=RSS

OK, see that top part? Those are actually commands using the robots.txt syntax. They exist because if a search engine doesn’t understand ACAP, the robots.txt commands serve as backup. Basically those lines tell all search engines not to index various things on the site, such as print-only pages.

Now the second part? This is where ACAP gets to shine. It’s where the Irish Independent — which is part of the media group run by ACAP president Gavin O’Reilly — gets to express what they wish search engines would do, if they’d only recognize all the new powers that ACAP provides. And what do they do? EXACTLY the same blocking that they do using robots.txt.

So much for demonstrating the potential power of ACAP.

ACAP – some reviews

Unfortunately I haven’t found the time to have a detailed look at the 1.0 versions of the ACAP spec and do a technical review as promised last year. But others have done it. And i have to say that i largely agree with them and might have bin harsher. See below for my favourite quotes from the reviews and some comments to their remarks.

Even more questions wrt. to the ACAP spec, (at least to part 2 ) arise if you compare it with a very similar effort of the W3C called “Access Control for Cross-site Requests”. This working draft was brought to my attention only today (via John Resig’s blog post). But since it is around since 2005 and i dearly hope that the ACAP project at least looked at the W3C to check for related work when embarking onto its mission, i’m wondering about ACAPs position wrt. this effort.

Reviews

A couple of (technical) reviews of ACAP have shown up on the net (Given the buzz ACAP tried to generate, i have to say relatively few). They are strikingly similar to my major critique points:

  • The lack of information in the spec. This lack of information leads to the fact that a good willing search engine has to assume a worst case scenario in its decision procedure.
  • This in turn raises the computational complexity of the implied decision procedure that has to be implemented by a search engine. This complexity is especially triggered by the following aspects of the specification:
    • The allowance of the wildcard characters in the ressource paths
    • The complexity of the present construct

I’m deliberately going to focus on the technical parts of the various reviews and skip the non-technical parts (while mostly agreeing with them).

Martin Bedlam: ACAP – flawed and broken from the start?

Let’s start with Martin Bedlams review titled: ACAP – flawed and broken from the start?

It isn’t technically sound

When I say technically sound, I don’t mean it will actually break things, but it doesn’t appear to have a technically robust background to it.

Well, I say it won’t actually break things, but the technical framework document itself admits that the specification includes:

“extensions for which there are possible security vulnerability or other issues in their implementation on the web crawler side, such as creating possible new opportunities for cloaking or Denial of Service attack.”

I’ve no doubt that there has been technical input into the specification. It certainly doesn’t seem, though, to have been open to the round-robin peer review that the wider Internet community would expect if you were introducing a major new protocol you effectively intended to replace robots.txt.

I definitely agree with Martin on this one. IMHO specially the door for DoS-attacks is left wide open (e.g. by providing ACAP specs with an insane amount of entries etc.)

The ACAP website tools don’t work

So just to re-cap – the consortium behind ACAP are proposing a new technical standard that unilaterally extends HTML and robots.txt, and wants to redefine the machine-to-machine relationship between every web publisher and every search engine or content aggregator. According to the ACAP FAQ: “The primary drivers of ACAP are the World Association of Newspapers (WAN), the European Publishers Council (EPC) and the International Publishers Association (IPA)”

And yet it appears that between them they haven’t been able to hire a Perl developer who can write a CGI script that outputs plain text…?

I actually think that i dont’ think that this is essential. Having a tool to convert a robots.txt into an ACAP spec isn’t actually needed. The fact that that it is suboptimal IMHO just shows that the technical ACAP people have seen it the same way and this tool was actually just needed for marketing purposes.

An editor critiques the publishing industry’s Automated Content Access Protocol

Second there is a review by Andy Oram on O’Reiily Radar called : An editor critiques the publishing industry’s Automated Content Access Protocol“:

Technical demands of ACAP

Lauren Weinstein presciently demonstrates that publishers are likely to turn ACAP from a voluntary cooperation into a legal weapon, and suggests that it shifts the regulatory burden for copyright infringement (as well as any other policy defined at the whim of the publishers) from the publishers to the search engines. I would add that a non-trivial technical burden is laid on search engines too.

First, the search engine must compile a policy that could be a Cartesian product of a huge number of coordinates, such as:

  • Whether to index the actual page found, or another source specified by the publisher as a proxy for that page, or just to display some fixed text or thumbnail provided by the publisher
  • When to take down the content or recrawl the site
  • Whether conversions are permitted, such as from PDF to HTML
  • Whether translations to another language are permitted


Seasoned computer programmers and designers by now can recognize the hoary old computing problem of exponential complexity–the trap of trying to apply a new tool to every problem that is currently begging for an a solution. Compounding the complexity of policies is some complexity in identifying the files to which policies apply. ACAP uses the same format for filenames as robots.txt does, but some rarely-used extensions of that format interact with ACAP to increase complexity. Search engines decide which resources to apply a policy to by checking a filename such as:

/news/*/image*/

The asterisks here can refer to any number of characters, including the slashes that separate directory names. So at whatever level in the hierarchy the image*/ subdirectories appear, the search engine has to double back and figure out whether it’s part of /news/. The calculation involved here shouldn’t be as bad as the notorious wildcard checks that can make a badly designed regular expression or SQL query take practically forever. For a directory pathname, there are ways to optimize the check–but it still must be performed on every resource. And if there are potentially competing directory specifications (such as /news/*.jpg) the search engine must use built-in rules to decide which specification applies, a check that I believe must be done at run-time.

100% agreed. ACAP gives a very rough description how to do conflict resolution between multiple matching rules in section 2.4.5. Again this description not nearly has the level of detail i expect from the description of a crucial algorithm within a specification. Just compare it to the specification of the algorithms in the W3C spec mentioned above.

It further has a “smell” that the procedure is not nearly catching the semanticsi it is intended to catch. Especially the rule “If one pattern contains a dollar sign $ where the other pattern contains any other character (including the asterisk *), the other pattern has the narrower scope. “ looks wrong and IMHO contradicts the classical regular expression semantics.

The ACAP committee continues to pile on new demands. Part 2 of their specification adds the full power of their protocol to META tags in HTML files. This means each HTML file could potentially have its own set of policies, and the content of the file must be read to determine whether it does.

Finally comes the problems of standardization that I described three years ago in the article From P2P to Web Services: Addressing and Coordination. Standards reflect current needs and go only so far as the committee’s imagination allows. They must be renegotiated by all parties whenever a new need arises. New uses may be held up until the specification is amended. And the addition of new use cases exacerbate the complexity from which ACAP already suffers.

Also 100% agreed. IMHO the spec is full of lock-ins into todays’ technology, especially wrt. the present usage type.

James Grimmelmann: Automated Content Access Problems and Automated Content Access Progress

Last but not least James Grimmelmann did a review and a follow-up, with Francis Cave (ACAP’s technical project manager) responding in the comments. The review generally also focuses on the technical quality (and also goes into the specifics). Anybody interestes in ACAP should read them complety (same is true for the other reviews mentioned above).

Hence i just pick one quote that relates to one part of the specs i also found “strange” while reading them:

Take 2.5.3.1, which allows sites to express time limits for how long a document may be indexed. The time limits are expressed in days. Days. How hard would it have been to add hours and minutes? To use UTC times? To express what time zone a given date refers to? Not hard at all. But no one did, which says to me that the working group wasn’t pushed very hard by people who really design Internet software for a living.

100% agreed. Using standards like ISO8601 / RFC3339 or the date formats in RFC822 or RFC2616 is imperative for me. There are already more than enough formats and IMHO it is better to choose oner (or more) of them instead of suffering from the NIH syndrome. I would also like to know what is missing from the standard ISO formats. Omitting the / a time zonefrom the spec in a standard that should be used globally issimply not acceptable.

What next?

I sincerely hope that the ACAP specs will improve towards the standard i expect from specifications. If not, i think there is every reason to just ignore it and the whole ACAP project was just a publicity stunt. Even Francis Cave admitted that it was wrong to theses specs 1.0. I personally would rather call them in the 0.1 to 0.3 range.

IMHO there are at least two areas where beside the addition of examples and more work on the details basic research has to be done:

Conflict Resolution: Algorithm and complexity

A very close look at the conflict resolution algorithm implementing an essential part of the ACAP semantics that  is definitely necessary in order to have a better understanding if ACAPs approach is feasible. But Section 2.4.5 is technically a mess and i can’t help getting the feeling that the ACAP project took a “this is not our problem” approach to it. At least for me this attitude is expressed in the following sentence:

In the event that a crawler is unable to determine which ACAP permission or prohibition has the narrowest scope – regardless of whether or not such determination is theoretically possible – the usage is prohibited on that resource.

One could read that sentence the following way: “We don’t know if it is actually possible to implement the ACAP semantics but you have to have to comply or take the risk of legal action if your interpretation of narrowest scope differs from our intended semantics”.

In order for a search engine to take this potentially huge risk, first the intended semantics must be expressed crystal clear or better formally defined and second it must be possible to implement the semantics in a way that is computational feasible on a large scale.

But the semantics is in no way expressed crystal clear or formally defined. For example I couldn’t find a single explicit statement in the whole specs clarifying if the underlying model of ACAP is permissive or restrictive. Hence i asked Francis Cave at the ACAP conference . His answer was at the moment permissive but this may change in the future. But reading the above sentence the opposite seems to be the case for me. I couldn’t find anything in section 2.4.5 claryfying what is supposed to happen when there is no matching permission rule for an sctual URL.

Wrt. to the implementation complexity of the intended ACAP semantics, my (educated) gut feeling is that the inclusion of the wildcard characters + and * completely changes the game. Prefix string matching (like in rbts.txt) and subsumption testing for regular expressions are two completely different beasts wrt. computational complexity. AFAIR the latter is co-NP-complete for generalized regular expressions (which at first glance are needed because of numerous allow and deny rules for the same pattern). And subsumption checking is the intended behavior at least i associate with the non-technical description of the conflict resolution algorithm:

If at least one permission field and at least one prohibition field apply to the same usage type and are applicable to the same resource, the permission or prohibition with the narrowest effective scope is applied and the others are ignored. Determining which permission or prohibition field has the narrowest effective scope depends upon a comparison of the resource path patterns of the conflicting fields.

Looking at regular expressions i automatically, like most computer scientists, associate the languages that these regular expression accept with the phrase “scope”. Hence the “narrowest scope” has to be determined by subsumption checking, in this case testing set inclusion of the accepted languages, meaning that there is only a partial order of the scopes. Two clauses “a” and “b” cannot ordered wrt. to scope, they are incomparable wrt. the partial order.

Since the following pseudo algorithm claims to determine for each pair of clause which one has the narrowest scope it has to implement something different. But even as a description of an algorithm that operationally implements the official ACAP conflict resolution scheme ( and may be implementing something completely different than said above) it misses essential information, e.g.:

  • What is the underying alphabet?
  • How is an actual URL is then matched against this ordering of the scopes?
  • If there is an allow and a deny rule for the same pattern which one takes precedence?
  • What happens when no matching rule is to be found?
  • What happens when two matching rules are incomparable?

I order to clear up this mess i think the following has to be done:

  1. ACAP has either to formally define the intended semantics in an unambigous wa, or operationally define the semantics by the definition of an algorithm that actually solves the problem at hand.
  2. In the second case ACAP has to provide a reference implementation
  3. In either case the complexity of the intended semantics has to be determined
  4. If the complexity proves to be too high ( and i guess any algorithm that more than quadaratic in the number of rules will be), measures have to taken to cut the complexity down, e.g. restricting the number of rules, restricting the expressiveness of the rules etc.

The present verb

The second area to lok at ist the present  verb. In its current form the present verb is to complex, confusing and i think just unnecessary. It adds to much complexity and creates lock-in into current technology . Most if not all use cases may also be implemented by presenting the version of the content the crawler should index to the crawler. It’s just another user-agent to detect and to optimize the content for. Same game as with desktop browsers or mobile browsers, so evrey publisher seriously thinking about implementing ACAP should have the technology needed in place.

My personal consequence

With this article i’ll stop looking at ACAP in my spare time until the problems above are tackled / solved. I might return to ACAP in an official function representing dpa (and i’m looking forward to do so) .

ACAP – The strawman proposals

In a couple of hours i’m going to attend the “International Conference on the Conclusion of the ACAP Pilot” at AP’s premises here in NYC. (This has given me the chance to see some of my AP friends :-) ).

For those of you who don’t know ACAP. According to the website:

Following a successful year-long pilot project, ACAP (Automated Content Access Protocol) has been devised by publishers in collaboration with search engines to revolutionise the creation, dissemination, use, and protection of copyright-protected content on the worldwide web.

ACAP is set to become the universal permissions protocol on the Internet, a totally open, non-proprietary standard through which content owners can communicate permissions for access and use to online intermediaries.

IMHO, ACAP is first and foremost a much needed “neutral ground” where publishers can meet, exchange ideas, and start joint lobbying efforts etc. Don’t get me wrong, i really think that such a place was missing, and hence i more or less talked my employer, the german news agency dpa, into becoming a ACAP member.

I also think that there is lack of a “standard” way to communicate commercial content rights to end-users and search-engines etc. Hence my primary reason for joining was to be able to have a closer look at the developed technology, in order to be able to judge whether this technology would contribute to close this gap. Hence i already have written a couple of times on the topic of ACAP on this blog.

ACAP – the promise

The following is stated on the invitation to the conference:

As ACAP reaches the final phase of its 12-month pilot, representatives of the
publishing and online community will be showcasing the successful development of
the new, open standard through which the owners of content published on the World
Wide Web can provide permissions information (relating to access and use of their
content) in a form that can be recognised and interpreted automatically, so that
search engine operators and other online intermediaries are enabled systematically
to comply with policies established by content owners.

ACAP will allow publishers, broadcasters and indeed any other publisher of content
on the network to express their individual access and use policies in a language that
search engine robots and similar automated tools can read and understand.

This conference will demonstrate beyond all doubt, the need for ACAP and the
potential disaster for the global publishing industry should it fail to embrace new
technology to protect its future.

Big words. So, later today a lot of publishing bigwigs will be at the conference, things will be announced, politicians are going to speak, and “i will be one of the few attendees that actually cared to read the technical documents”.

I’ve taken the time to read the:

  • Strawmans proposals (part I and II), and the
  • Usage definitions

Unfortunately i had only the Semptember documents with me. So i wasn’t able to check (until now), if there are any significant differences in the October version and /or the final versions, that just have been put up some 30h ago.

Too bad, that there are neither documents highlighting the edits between versions, nor some easy way to run a diff on the versions. (Hint: There is a reason why RFCs are still ordinary texts.)

So, to make it very clear. What follows is based on my loose acquaintance with the project (i.e. i attended the first conference in London) and a one-time thorough reading of the september draft specs (on a flight while having a terrible headache) . So i might be terribly wrong. If so please tell me.

ACAP – My verdict

While being very successful at organising a common platform for publishers, ACAP fails big time to convince me that the proposed technical solution is actually going to be the solution to problems as stated above.

And while i’m the first to agree that the publishing industry has to embrace new technologies in order to avoid potential disaster, i think ACAP carries more of a backward oriented, let’ s protect our territory, attitude than a forward looking, let’s explore new worlds thinking.

Single use case only

On one hand it is too focused on a single use case: Telling search engines what are they allowed to do with content residing on web sites.

What about readers? They also want to know what they are allowed to do with the content they are reading. Are they allowed to put the about them on their personal website, on their blog?

What about content not residing as HTML/XHTML pages on websites? News agencies still deliver their content mainly via wires (via satellite or FTP). What about RSS and ATOM feeds, the standard content delivery format in “the developed countries” of the internet. What about content reuse in Facebook / OpenSocial apps?

What about images, audio, video and a way to embed the rights into the original data?

Update: In the talks at the conference it became clear that there is some kind of roadmap for the other use cases, especially the syndication use case, which I’m most interested in my daily profession. This use case is going to be based on an XML format (which itself is to be based on the ONIX proposals). Hopefully this will be drastically reduced in complexity (see my earlier posts on ONIX). If anything works out well, i might even contribute to that format.

Wrong granularity

Instead of looking at theses broader issues , ACAP focuses on ways to define on a very fine level of granularity, what search engines are allowed to do with the content residing on webpages. This leads to the possible creation of very complex permission sets. Permission sets that lead to a cmputation order of complexity, that makes it practically (may be even theoretically) impossible to implement for search engines. Especially the permissions / restrictions defined on the present verb are very fine grained and lead to very complex renderings, that in addition, given the presumable striking differences between permission given by differente publishers, lead to visual disaster on the search result pages.

Bad technical quality of the specs

I tried to read the documentation the same way i did read term papers of students or research papers of colleagues while revieing the papers back in my teaching /researching days. That means:Trying very hard to understand what was written, scribbing remarks and question marks when i didn’t get it the first time reading, checking for completeness of the presentation, looking out for contradictions, self-containedness of all necessary information etc. You know it.

And i have to say that i can’t remember a term-paper /research-paper or thesis, that even in the earliest versions has been of such a bad technical quality as the September ACAP documentation. May be i’m getting old and do not remember correctly, and there have been some, but definitely not very many.

So i hope for an improvement in the final documentation, because in the september release the documentation fails miserably in fulfilling its self proclaimed primary requirement:

Fundamentally, ACAP requires consistent and unambiguous interpretation of all its
permissions.

Update: I now had the chance to look at the 1.0 specs and things definitely look better. In talks with the project participants it als became clear that (as usual) the docs had to been pushed out on a rush and other more refined documentation is on its way. I also was able to get the basic question if the basic model is permissive or restrictive, an information that is missing from the documents. It is permissive like the model of robots.txt. But this may change with a 2.0 version when ACAP is no longer that closely intertwined with the REP.

Where do go from here

A year ago, directly after the ACAP announcement i wrote the following on this blog:

As often noticed in there is already a de facto standard protocol (the robots exclusion protocol) which is machine readable and that tells search engines which content (not) to spider. So if a newspaper wants a search engine not to index her pages all they have to do is to include an appropriate robots.txt file. Furthermore there are also machine readable means (e.g. the creative commons license framework) for automatically communicating the terms under which a content can be used.

Unfortunately the the robot exclusion protocol is not an “official” standard e.g. by the W3C , and the “creative commons” framework doesn’t cover possibilities to list exceptions to the various restrictions imposed by the license or in some way ease the way of waiving the restrictiond by (semi-)automatically getting the permission from the rightsholder.

So there definitely is room for improvement on both. As long as ACAP builds on these lightweigth and broadly accepted standards, i’m interested in it. In might be useful and it might even be used.

Looking at these sentences today, i have to say that at least ACAP tried to build on the REP. But by broadening their stated goals to a solution to the whole publishing industry, a REP based solution is definitely not enough (see above).

And they completely neglected creative commons, IMHO a major mistake.

Creative Commons – the better ACAP?

CC tries to define different common use cases on the scale from “All rights reserved” to “Public domain”, leaning definitely in the half where more rights are granted than reserved. Typically the rights publishers have in mind are traditionally in the other half, making a perfect complementary fit.

And years and years of development have gone into supporting tools for CC, search engine enhancements supporting CC etc, not to mention all the work that has been spent in adapting the licenses to the local jurisdictions.

In addition to that publishers sooner or later will publish cc-ed content, so they have to know and implement cc in their processes.

Hence, IMHO building on top of CC would have been definitely the better way to create ACAP. But i guess that this way was politically not feasible for the publishing industry.

Update:  I was happy to hear that ACAP is going to talk to Creative Commons soon and also recognizes that creative Commons is especially interesting in the non search engine use cases.

PS.: I try to write a second post with a technical critique of the September ACAP documentation. But since i’m leaving for a 2 1/2 weeks holiday tomorrow, i’m not sure if this is going to happen soon