Unfortunately I haven’t found the time to have a detailed look at the 1.0 versions of the ACAP spec and do a technical review as promised last year. But others have done it. And i have to say that i largely agree with them and might have bin harsher. See below for my favourite quotes from the reviews and some comments to their remarks.
Even more questions wrt. to the ACAP spec, (at least to part 2 ) arise if you compare it with a very similar effort of the W3C called “Access Control for Cross-site Requests”. This working draft was brought to my attention only today (via John Resig’s blog post). But since it is around since 2005 and i dearly hope that the ACAP project at least looked at the W3C to check for related work when embarking onto its mission, i’m wondering about ACAPs position wrt. this effort.
A couple of (technical) reviews of ACAP have shown up on the net (Given the buzz ACAP tried to generate, i have to say relatively few). They are strikingly similar to my major critique points:
- The lack of information in the spec. This lack of information leads to the fact that a good willing search engine has to assume a worst case scenario in its decision procedure.
- This in turn raises the computational complexity of the implied decision procedure that has to be implemented by a search engine. This complexity is especially triggered by the following aspects of the specification:
- The allowance of the wildcard characters in the ressource paths
- The complexity of the present construct
I’m deliberately going to focus on the technical parts of the various reviews and skip the non-technical parts (while mostly agreeing with them).
Martin Bedlam: ACAP – flawed and broken from the start?
Let’s start with Martin Bedlams review titled: ACAP – flawed and broken from the start?”
It isn’t technically sound
When I say technically sound, I don’t mean it will actually break things, but it doesn’t appear to have a technically robust background to it.
Well, I say it won’t actually break things, but the technical framework document itself admits that the specification includes:
“extensions for which there are possible security vulnerability or other issues in their implementation on the web crawler side, such as creating possible new opportunities for cloaking or Denial of Service attack.”
I’ve no doubt that there has been technical input into the specification. It certainly doesn’t seem, though, to have been open to the round-robin peer review that the wider Internet community would expect if you were introducing a major new protocol you effectively intended to replace robots.txt.
I definitely agree with Martin on this one. IMHO specially the door for DoS-attacks is left wide open (e.g. by providing ACAP specs with an insane amount of entries etc.)
The ACAP website tools don’t work
So just to re-cap – the consortium behind ACAP are proposing a new technical standard that unilaterally extends HTML and robots.txt, and wants to redefine the machine-to-machine relationship between every web publisher and every search engine or content aggregator. According to the ACAP FAQ: “The primary drivers of ACAP are the World Association of Newspapers (WAN), the European Publishers Council (EPC) and the International Publishers Association (IPA)”
And yet it appears that between them they haven’t been able to hire a Perl developer who can write a CGI script that outputs plain text…?
I actually think that i dont’ think that this is essential. Having a tool to convert a robots.txt into an ACAP spec isn’t actually needed. The fact that that it is suboptimal IMHO just shows that the technical ACAP people have seen it the same way and this tool was actually just needed for marketing purposes.
An editor critiques the publishing industry’s Automated Content Access Protocol
Second there is a review by Andy Oram on O’Reiily Radar called : An editor critiques the publishing industry’s Automated Content Access Protocol“:
Technical demands of ACAP
Lauren Weinstein presciently demonstrates that publishers are likely to turn ACAP from a voluntary cooperation into a legal weapon, and suggests that it shifts the regulatory burden for copyright infringement (as well as any other policy defined at the whim of the publishers) from the publishers to the search engines. I would add that a non-trivial technical burden is laid on search engines too.
First, the search engine must compile a policy that could be a Cartesian product of a huge number of coordinates, such as:
- Whether to index the actual page found, or another source specified by the publisher as a proxy for that page, or just to display some fixed text or thumbnail provided by the publisher
- When to take down the content or recrawl the site
- Whether conversions are permitted, such as from PDF to HTML
- Whether translations to another language are permitted
Seasoned computer programmers and designers by now can recognize the hoary old computing problem of exponential complexity–the trap of trying to apply a new tool to every problem that is currently begging for an a solution. Compounding the complexity of policies is some complexity in identifying the files to which policies apply. ACAP uses the same format for filenames as robots.txt does, but some rarely-used extensions of that format interact with ACAP to increase complexity. Search engines decide which resources to apply a policy to by checking a filename such as:/news/*/image*/
The asterisks here can refer to any number of characters, including the slashes that separate directory names. So at whatever level in the hierarchy the image*/ subdirectories appear, the search engine has to double back and figure out whether it’s part of /news/. The calculation involved here shouldn’t be as bad as the notorious wildcard checks that can make a badly designed regular expression or SQL query take practically forever. For a directory pathname, there are ways to optimize the check–but it still must be performed on every resource. And if there are potentially competing directory specifications (such as /news/*.jpg) the search engine must use built-in rules to decide which specification applies, a check that I believe must be done at run-time.
100% agreed. ACAP gives a very rough description how to do conflict resolution between multiple matching rules in section 2.4.5. Again this description not nearly has the level of detail i expect from the description of a crucial algorithm within a specification. Just compare it to the specification of the algorithms in the W3C spec mentioned above.
It further has a “smell” that the procedure is not nearly catching the semanticsi it is intended to catch. Especially the rule “If one pattern contains a dollar sign $ where the other pattern contains any other character (including the asterisk *), the other pattern has the narrower scope. “ looks wrong and IMHO contradicts the classical regular expression semantics.
The ACAP committee continues to pile on new demands. Part 2 of their specification adds the full power of their protocol to META tags in HTML files. This means each HTML file could potentially have its own set of policies, and the content of the file must be read to determine whether it does.
Finally comes the problems of standardization that I described three years ago in the article From P2P to Web Services: Addressing and Coordination. Standards reflect current needs and go only so far as the committee’s imagination allows. They must be renegotiated by all parties whenever a new need arises. New uses may be held up until the specification is amended. And the addition of new use cases exacerbate the complexity from which ACAP already suffers.
Also 100% agreed. IMHO the spec is full of lock-ins into todays’ technology, especially wrt. the present usage type.
James Grimmelmann: Automated Content Access Problems and Automated Content Access Progress
Last but not least James Grimmelmann did a review and a follow-up, with Francis Cave (ACAP’s technical project manager) responding in the comments. The review generally also focuses on the technical quality (and also goes into the specifics). Anybody interestes in ACAP should read them complety (same is true for the other reviews mentioned above).
Hence i just pick one quote that relates to one part of the specs i also found “strange” while reading them:
Take 18.104.22.168, which allows sites to express time limits for how long a document may be indexed. The time limits are expressed in days. Days. How hard would it have been to add hours and minutes? To use UTC times? To express what time zone a given date refers to? Not hard at all. But no one did, which says to me that the working group wasn’t pushed very hard by people who really design Internet software for a living.
100% agreed. Using standards like ISO8601 / RFC3339 or the date formats in RFC822 or RFC2616 is imperative for me. There are already more than enough formats and IMHO it is better to choose oner (or more) of them instead of suffering from the NIH syndrome. I would also like to know what is missing from the standard ISO formats. Omitting the / a time zonefrom the spec in a standard that should be used globally issimply not acceptable.
I sincerely hope that the ACAP specs will improve towards the standard i expect from specifications. If not, i think there is every reason to just ignore it and the whole ACAP project was just a publicity stunt. Even Francis Cave admitted that it was wrong to theses specs 1.0. I personally would rather call them in the 0.1 to 0.3 range.
IMHO there are at least two areas where beside the addition of examples and more work on the details basic research has to be done:
Conflict Resolution: Algorithm and complexity
A very close look at the conflict resolution algorithm implementing an essential part of the ACAP semantics that is definitely necessary in order to have a better understanding if ACAPs approach is feasible. But Section 2.4.5 is technically a mess and i can’t help getting the feeling that the ACAP project took a “this is not our problem” approach to it. At least for me this attitude is expressed in the following sentence:
In the event that a crawler is unable to determine which ACAP permission or prohibition has the narrowest scope – regardless of whether or not such determination is theoretically possible – the usage is prohibited on that resource.
One could read that sentence the following way: “We don’t know if it is actually possible to implement the ACAP semantics but you have to have to comply or take the risk of legal action if your interpretation of narrowest scope differs from our intended semantics”.
In order for a search engine to take this potentially huge risk, first the intended semantics must be expressed crystal clear or better formally defined and second it must be possible to implement the semantics in a way that is computational feasible on a large scale.
But the semantics is in no way expressed crystal clear or formally defined. For example I couldn’t find a single explicit statement in the whole specs clarifying if the underlying model of ACAP is permissive or restrictive. Hence i asked Francis Cave at the ACAP conference . His answer was at the moment permissive but this may change in the future. But reading the above sentence the opposite seems to be the case for me. I couldn’t find anything in section 2.4.5 claryfying what is supposed to happen when there is no matching permission rule for an sctual URL.
Wrt. to the implementation complexity of the intended ACAP semantics, my (educated) gut feeling is that the inclusion of the wildcard characters + and * completely changes the game. Prefix string matching (like in rbts.txt) and subsumption testing for regular expressions are two completely different beasts wrt. computational complexity. AFAIR the latter is co-NP-complete for generalized regular expressions (which at first glance are needed because of numerous allow and deny rules for the same pattern). And subsumption checking is the intended behavior at least i associate with the non-technical description of the conflict resolution algorithm:
If at least one permission field and at least one prohibition field apply to the same usage type and are applicable to the same resource, the permission or prohibition with the narrowest effective scope is applied and the others are ignored. Determining which permission or prohibition field has the narrowest effective scope depends upon a comparison of the resource path patterns of the conflicting fields.
Looking at regular expressions i automatically, like most computer scientists, associate the languages that these regular expression accept with the phrase “scope”. Hence the “narrowest scope” has to be determined by subsumption checking, in this case testing set inclusion of the accepted languages, meaning that there is only a partial order of the scopes. Two clauses “a” and “b” cannot ordered wrt. to scope, they are incomparable wrt. the partial order.
Since the following pseudo algorithm claims to determine for each pair of clause which one has the narrowest scope it has to implement something different. But even as a description of an algorithm that operationally implements the official ACAP conflict resolution scheme ( and may be implementing something completely different than said above) it misses essential information, e.g.:
- What is the underying alphabet?
- How is an actual URL is then matched against this ordering of the scopes?
- If there is an allow and a deny rule for the same pattern which one takes precedence?
- What happens when no matching rule is to be found?
- What happens when two matching rules are incomparable?
I order to clear up this mess i think the following has to be done:
- ACAP has either to formally define the intended semantics in an unambigous wa, or operationally define the semantics by the definition of an algorithm that actually solves the problem at hand.
- In the second case ACAP has to provide a reference implementation
- In either case the complexity of the intended semantics has to be determined
- If the complexity proves to be too high ( and i guess any algorithm that more than quadaratic in the number of rules will be), measures have to taken to cut the complexity down, e.g. restricting the number of rules, restricting the expressiveness of the rules etc.
The present verb
The second area to lok at ist the present verb. In its current form the present verb is to complex, confusing and i think just unnecessary. It adds to much complexity and creates lock-in into current technology . Most if not all use cases may also be implemented by presenting the version of the content the crawler should index to the crawler. It’s just another user-agent to detect and to optimize the content for. Same game as with desktop browsers or mobile browsers, so evrey publisher seriously thinking about implementing ACAP should have the technology needed in place.
My personal consequence
With this article i’ll stop looking at ACAP in my spare time until the problems above are tackled / solved. I might return to ACAP in an official function representing dpa (and i’m looking forward to do so) .