DocumentCloud and OpenCalais: Some Questions


Recently another KnightNews challenge winner announced the availability of the open source version of the code. This time DocumentCloud, a joined effort of NYT and ProPublica (not sure why these bigshots need grant money to do these things, but this is another story). It is opensourcing CrowdCloud which claims:  Parallel processing  for the rest of us.

Yesterday it announced another two-dozenhigh profile content partners (Nieman Labs view on this) as well as a partnership with ThomsonReuters OpenCalais (DocumentCloud Blog Post):

This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.

I’m very excited to use DocumentCloud / CrowdCloud but i have a couple of questions regarding the OpenCalais Terms of Service. Since i’m not sure when they’ll make it through moderation, i’m reposting them here:

Can you (and Thomson Reuters) please clarify if you are using the public free version of OpenCalais. If so it would be very helpful to get your reading of the terms of service. Until now the terms were the reason that i’m very hesistant to use OpenCalais for tasks at my news org.

Since i would very much love to use OpenCalais and DocumentCloud it would be very helpful for me to get more information on your interpretation to the following parts of the terms:

1. As far as I understand the Terms of the service of the public version not only allow Reuters to keep and use the metadata (with some rumour that the full text is part of the metadata).


You understand that Thomson Reuters will retain a copy of the metadata submitted by you or that generated by the Calais service. By submitting or generating metadata through the Calais service, you grant Thomson Reuters a non-exclusive perpetual, sublicensable, royalty-free license to that metadata.”

Since Reuters somehow has to refinance the operation of OpenCalais i’m basically fine with that clause but it would be interesting to know about the types of services and products they are sublicensing the metadata to.

2. IMHO the terms make it at least difficult to use other Metadata extraction means e.g Homegrown NLTK / GATE jobs, Metacarta API, inxight, empolis, … etc. or offering this metadata as part of your own api e.g. the NYT API.

“# If you syndicate, publish or otherwise transmit any content containing, enhanced by or derived from Calais-generated metadata you will use your best efforts to incorporate the correct Calais-provided Globally Unique Identifier (GUID) in that content. You specifically agree not to attach incorrect GUIDs to your content with any intent to mislead, spam, spoof, phish or otherwise deceive downstream consumers of your content.

# You will not use any metadata or GUIDs produced by Calais to create a metadata retrieval service similar to Calais. To ensure the quality of metadata for all Calais users we want to maintain a single verifiable metadata storage location.”

I read these clauses such that e.g.the NYT Times API (as among other things a metadata retrieval service for people, persons and places) is not allowed to use the public OpenCalais service as part of its processing. Is my interpretation too strict? I’m basically talking about open Calais as a preprocessing step where the results would be curated by human beings

BTW: The last sentence of that quote looks very strange to me give the “Linked Open Data” initiative, including Freebase, DBPedia etc which all provide their ow GUIDs.

3. One clause of the terms for me looks like DocumentCloud is in direct violation of it:

“You will not do bulk processing where you are adding minimal value beyond adding Calais metadata to the content. For example – if you are a webcrawler you should not send everything to Calais before sending it to your users.”

Since DocumentCloud is all about bulk processing: Was this claused waived for DocumentCloud (including all uses of DocumentCloud in outside the original partners installation? E.g. Systems  derived from DocumentCloud / github clones, … Or does it mean that i cannot do only Metadata annotation on a DocumentCloud job but have to do some other things in the same job too?

I hope that most if not all of these questions have already been asked and answered by the various content partners and it’s easy for you to answer them.