DocumentCloud and OpenCalais: Some Questions

Recently another KnightNews challenge winner announced the availability of the open source version of the code. This time DocumentCloud, a joined effort of NYT and ProPublica (not sure why these bigshots need grant money to do these things, but this is another story). It is opensourcing CrowdCloud which claims:  Parallel processing  for the rest of us.

Yesterday it announced another two-dozenhigh profile content partners (Nieman Labs view on this) as well as a partnership with ThomsonReuters OpenCalais (DocumentCloud Blog Post):

This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.

I’m very excited to use DocumentCloud / CrowdCloud but i have a couple of questions regarding the OpenCalais Terms of Service. Since i’m not sure when they’ll make it through moderation, i’m reposting them here:

Can you (and Thomson Reuters) please clarify if you are using the public free version of OpenCalais. If so it would be very helpful to get your reading of the terms of service. Until now the terms were the reason that i’m very hesistant to use OpenCalais for tasks at my news org.

Since i would very much love to use OpenCalais and DocumentCloud it would be very helpful for me to get more information on your interpretation to the following parts of the terms:

1. As far as I understand the Terms of the service of the public version not only allow Reuters to keep and use the metadata (with some rumour that the full text is part of the metadata).

You understand that Thomson Reuters will retain a copy of the metadata submitted by you or that generated by the Calais service. By submitting or generating metadata through the Calais service, you grant Thomson Reuters a non-exclusive perpetual, sublicensable, royalty-free license to that metadata.”

Since Reuters somehow has to refinance the operation of OpenCalais i’m basically fine with that clause but it would be interesting to know about the types of services and products they are sublicensing the metadata to.

2. IMHO the terms make it at least difficult to use other Metadata extraction means e.g Homegrown NLTK / GATE jobs, Metacarta API, inxight, empolis, … etc. or offering this metadata as part of your own api e.g. the NYT API.

“# If you syndicate, publish or otherwise transmit any content containing, enhanced by or derived from Calais-generated metadata you will use your best efforts to incorporate the correct Calais-provided Globally Unique Identifier (GUID) in that content. You specifically agree not to attach incorrect GUIDs to your content with any intent to mislead, spam, spoof, phish or otherwise deceive downstream consumers of your content.

# You will not use any metadata or GUIDs produced by Calais to create a metadata retrieval service similar to Calais. To ensure the quality of metadata for all Calais users we want to maintain a single verifiable metadata storage location.”

I read these clauses such that e.g.the NYT Times API (as among other things a metadata retrieval service for people, persons and places) is not allowed to use the public OpenCalais service as part of its processing. Is my interpretation too strict? I’m basically talking about open Calais as a preprocessing step where the results would be curated by human beings

BTW: The last sentence of that quote looks very strange to me give the “Linked Open Data” initiative, including Freebase, DBPedia etc which all provide their ow GUIDs.

3. One clause of the terms for me looks like DocumentCloud is in direct violation of it:

“You will not do bulk processing where you are adding minimal value beyond adding Calais metadata to the content. For example – if you are a webcrawler you should not send everything to Calais before sending it to your users.”

Since DocumentCloud is all about bulk processing: Was this claused waived for DocumentCloud (including all uses of DocumentCloud in outside the original partners installation? E.g. Systems  derived from DocumentCloud / github clones, … Or does it mean that i cannot do only Metadata annotation on a DocumentCloud job but have to do some other things in the same job too?

I hope that most if not all of these questions have already been asked and answered by the various content partners and it’s easy for you to answer them.

2 thoughts on “DocumentCloud and OpenCalais: Some Questions

  1. Gerd:

    I’m not going to respond to every point you make – but I will make a few general comments and observations. Of course you should take these as my interpretations – unless revised the Calais TOS document itself is the correct reference.

    DocumentCloud is a big deal for us. It not only leverages the technology we’ve deployed – but it serves a greater social good of supporting journalism and the free and open exchange of ideas and information.

    It’s also unique. Issues of integrity of information, confidentiality and transparency will be integral to the success of the DocumentCloud project. That being said – I’m certain that OpenCalais, the DocumentCloud team and the many publishers involved will be having many discussions about how to accomplish these goals. Those discussions may well lead to an relationship model that is unique to the DocumentCloud project.

    A few specific points.

    Yes, OpenCalais does retain the metadata. Contrary to any rumor you might have heard we do not and never have retained any original content or claimed any rights to it. It’s your content. Period. No exceptions. Ever.

    When you talk about our use of the metadata it’s important to make a basic distinction. OpenCalais retains metadata at two levels: the document and what I’ll refer to as “atomized” metadata – they’re two very different things.

    Document level metadata is – obviously – all of the metadata associated with a specific document. We consider this metadata to be particularly confidential and never expose it to any other OpenCalais user. The only mechanism for another user to gain access to this metadata is by the content submitter sharing a secret key – specifically a GUID – with someone else. If they don’t share it their metadata is never exposed to other OpenCalais users.

    You ask a fair question regarding what we’re doing with that metadata and the honest answer is – not much. At some point we’ll probably conduct some experiments such as looking for trends in co-occurrence of mentions of companies and other statistical examination – but that’s all that’s on the horizon at this point. When we reach some to-be-determined size threshold, there may be some interesting statistical insights we can glean that will be of value to us.

    Atomic metadata on the other hand is widely shared. Let’s talk about what it is and how it’s shared. If you send us an article about, for example, mining – we’ll create the document level RDF (which will identify some companies and probably a lot of other things and store it away). We’ll then break the RDF up and extract entities from it – for example “Metalline Mining Company”. Those specific entities are then published in our Linked Data ecosystem at a unique URL. Here’s an example: As anyone familiar with the Linked Data standard knows this is the first step toward enhancing the value of your content assets using Linked Data resources.

    As far as metadata retrieval – yes your interpretation is too strict. This isn’t about metadata extraction – all we ask is basically that you leave the OpenCalais GUIDs as is so that users are pointed back to our Linked Data store rather than some copy. That’s the only way we can ensure that the Linked Data references generated from OpenCalais are of high quality. Linked Data is a great thing – but we may end up in a situation with lots of dead or outdated links lying around – and we’d like to avoid that for OpenCalais users.

    As far as bulk processing – DocumentCloud is clearly – to an enormous extent – adding value beyond scraping, tagging and republishing. They’re empowering more effective journalism. We’re absolutely good with their use of the service.

    While I know I haven’t addressed each and every point you made I hope I’ve conveyed our general intentions and approach. OpenCalais has always striven to be transparent in our motivations, terms of service and privacy policies. With DocumentCloud we’ll continue with that transparent position while ensuring the OpenCalais service supports the unique needs of a large journalistic consortium.


    Tom Tague

Comments are closed.