Ed. note.: Another autumn-cleaning action. This time the post has been sitting here in a draft state since at least 15 months. Time to get it out of the door.
It’s part of a mini series called “Going places” about geocoding news at the source. Prior installments of this series can be found here. They for example explain what places of news, places within news and scopes are.
Current status of geocoded news at dpa-infocom
Since approx. 15months we are geocoding both places of news and places within news in our regional online wires. Right now we are geocoding:
- scopes as places of news and
- (generalized) addresses as places within news.
Representing geocodes within NITF
Being a news agency, IPTF formats are more or less the de facto standard of delivering news to our customers :-( Being the unline and mobile subsidiary we are delivering our wires as NITF. Hence we had to find a way to fit this information into this format.
In order to minimize the hassle for us as well as our customers we had to stay within the bounds of the NITF format as much as possible.
Since in the news industry the main delivery model is still push delivery (mostly via FTP :-( ) , there is also a need to include as much information about the scopes of news as possibe. Offering only a pointer (e.g. to a Restful API) that allows to access additional information would only be used by our most advanced customers :-( .
Hence we chose to use the language constructs for describing locations already provided in that format as much as possible and only resort to other means when there was no means for describing this information at all.
The NITF format (see NITF Documentation) provides at least two ways of representing location information:
- evloc : Event location. Where an event took place (as opposed to where the story was written).
- location : Significant place mentioned in an article. Used to normalize locations.
The first question to ask is why there are two different tags for geocoding locations (i suspect the standardisation process being responsible for that). Looking at the DTD definitions for both evloc and location, one can notice that they both try to describe the same information, but the location tag is actually the better and more detailed way of doing so.
Since we already used the evloc tag for denoting the country where the event primarily took place (i.e. some inverse locus like information ) we had every reason to only use the location tag for the locations of the news as well as the location in the news.
One might also note by looking at the DTD that apparently the NITF standardisation body didn’t consider the scope of the news to be a location and didn’t include any means to include any actual geographic data (e.g. points, lines, polygons, …).
But luckily an arbitrary number of locations can be included via the location tag, unfortunately only allowed in the head section of the document. The DTD of NITF then allows an arbitrary number of country, state, region, city and sublocation tags.
In order to be able to unambiguously represent the hierarchy we restricted this to a single occurence of the tags country, state, region and city as well as up to two sublocation tags.
Before going into detail, below is an example of a news story that has both: scopes and addresses. I guess it is the best way to explain our approach and to describe some of the problems we had / have to navigate. Location relevant part highlighted).
<?xml version="1.0" encoding="UTF-8"?> <!-- DOCTYPE nitf PUBLIC "-//IPTC-NAA//DTD NITF-XML 3.0//EN" "nitf.dtd" --> <nitf xmlns:georss="http://www.georss.org/georss"> <head> <title>Bayern München II schlägt Karlsruhe 3:1</title> ... <identified-content> <location class="scope"> <region region-code="09184000" code-source="AGS">München <georss:point>11.5725580365 48.1379548096</georss:point> </region> <state state-code="09000000" code-source="AGS">Bayern <georss:point>11.5725580365 48.1379548096</georss:point> </state> <country iso-cc="DEU">Deutschland</country> </location> <location class="scope"> <city city-code="09162000" code-source="AGS">München <georss:point>11.5725580365 48.1379548096</georss:point> </city> <state state-code="09000000" code-source="AGS">Bayern <georss:point>11.5725580365 48.1379548096</georss:point> </state> <country iso-cc="DEU">Deutschland</country> </location> <location class="scope"> <city city-code="08212000" code-source="AGS">Karlsruhe <georss:point>8.40437796821 49.0092142029</georss:point> </city> <state state-code="08000000" code-source="AGS">Baden-Württemberg <georss:point>9.17871582656 48.7750805322</georss:point> </state> <country iso-cc="DEU">Deutschland</country> </location> <location class="address"> Grünwalder Stadion, Grünwalder Straße, München, Germany <georss:point>11.566936 48.101078</georss:point> <city>München</city> <region>München</region> <state>Bayern</state> <country iso-cc="DEU">Deutschland</country> </location> </identified-content> </docdata> </head> <body> ... </body> </nitf>
I’ve chosen this story because it is about a soccer game, an example scenario i used in my last post. So we encoded three scopes and one address.
Since it is the a third league game, the editors chose to select only administrative regions covering the cities on a county level. One city (Munich) is actually divided into two counties, hence the sum of three counties.
If it would have been a premier league game, most likely there only would have been a single scope, the whole of germany whereas a second leugue game would presumably be encoded with some states.
The addresss represents the address of the stadium where the soccer game took place.
So let’s have a closer look at the example ‘s representation.
<location> <region region-code="09184000" code-source="AGS">München <georss:point>11.5725580365 48.1379548096</georss:point> </region> <state state-code="09000000" code-source="AGS">Bayern <georss:point>11.5725580365 48.1379548096</georss:point> </state> <country iso-cc="DEU">Deutschland</country> </location>
- NITF already provides attributes called xxxx-code and code-source for all possible subtags of location, and since we are primarily using the official german coding scheme for administrative regions called “Amtlicher Gemeinde Schlüssel” short: it is natural to encode it the way we do .
- The coding scheme of AGS is actually a hierarchically coding scheme (two digits: state, 1-digit: sub-state level (“Regierungsbezirk”), 3-digits: county, 3-digits: city/town) , hence we could do away with the state tag but we decided to be as explicit as we could be.
- Since we were using the three-letter variant of ISO3166 for the evloc tag we decided to use this variant also for the iso-cc attribute of the country tag.
- Since we introduced scopes first and some customers wanted to include markers-on their maps they asked for some coordinates. Hence we chose to include the “official” coordinates of the admin region, denoted in some other GIS dataset we bought and chose to make use of simple georss:point tag for doing so
- We were not allowed to distribute the geometries of the admin regions as part of our licensing deal (Yes you have to buy this data in germany) and the geometries would have used far too much bandwidth for sending them within the wire.
- In hindsight i would like to remove the coordinates from scope items since the are a) highly redundant, b) not always available and c) a constant source of discussion what an appropriate representative coordinate for a geographic extent might be
- We currently use other coding-schemes on a sub-city level for some cities (Also official coding schemes by the city goverment). But since these are hard to come by on a national level, we are currently considering alternatives
- We are also considering to extend the geocoding to our non-regional, i.e. national and international wires. Her we are looking into using the ISO3166-2 coding scheme and the NUTS3 coding scheme for the European Union
<location> Grünwalder Stadion, Grünwalder Straße, München, Germany <georss:point>11.566936 48.101078</georss:point> <city>München</city> <region>München</region> <state>Bayern</state> <country iso-cc="DEU">Deutschland</country> Remarks:
- Addresses are provided by the editor.
- The level of detail (exact address, strret level, district or city) presented is an editorial decision based on data protection regulations.
- The address is then geocoded by Google Maps Premiere and the resulting coordinates and the address returned by the geocoder are shown to the editor and validated by him
- The this information, togehter with a label of the address is encoded into an NITF location tag in the form: label, address
- The returned coordinates are also encoded into a georss:point tag.
- region, state and country are taken from the respective fields of the structured response of the Google geocoder. Hence they might differ in writing from the respective official names. But we chose not to do point in polygon queries in order to harmonize becuase this would have resulted in running a spatially enabled database e.g. Postgres/PostGis.
I just wanted to give some quick examples how our customers uses the geocodes in the wire.
First an iPhone App that uses the address coordinates to put the news on the map. A typical news map:
At the other end of the range is the way germany’s biggest tabloid Bild is using the scope information for automatically sorting the news into their different regiona portals. The following screen shots show how content from is sorted into three diffent regional portals within the state of Northrhine-Westphalia. News that have a scope of the whole state show up in all three portals, whereas news only having a scope of one or more counties are sorted into the regionl portals that contain these counties (better: the AGS codes of thes counties).
I’m planning to catch up with other aspects of geocoding at dpa in the next days so that i’m finally able to start writing about new ideas :-)