≈ Relations

Random Rants and Ramblings about Media and/or Technology

Archive for the ‘goingplaces’ tag

Going places – Status and example

one comment

D-Ticker Scrennshot

Ed. note.: Another autumn-cleaning action. This time the post has been sitting here in a draft state since at least 15 months. Time to get it out of the door.

It’s part of a  mini series called “Going places” about geocoding news at the source. Prior installments of this series can be found here. They for example explain what places of news, places within news and scopes are.

Current status of geocoded news  at dpa-infocom

Since approx.  15months we are  geocoding both places of news and places within news in our regional online wires. Right now we are geocoding:

  • scopes as places of news and
  • (generalized) addresses as places within news.

Representing geocodes within NITF

Being a news agency, IPTF formats are more or less the de facto standard of delivering news to our customers :-( Being the unline and mobile subsidiary we are delivering our wires as NITF. Hence we had to find a way to fit this information into this format.

In order to minimize the hassle for us as well as our customers we had to stay within the bounds of the NITF format as much as possible.

Since in the news industry the main delivery model is still push delivery (mostly via FTP :-( ) , there is also a need to include as much information about the scopes of news as possibe. Offering only a pointer (e.g. to a Restful API) that allows to access additional information would only be used by our most advanced customers :-( .

Hence we chose to use the language constructs for describing locations already provided in that format as much as possible and only resort to other means when there was no means for describing this information at all.

The NITF format (see NITF Documentation) provides at least two ways of representing location information:

  • evloc : Event location. Where an event took place (as opposed to where the story was written).
  • location : Significant place mentioned in an article. Used to normalize locations.

The first question to ask is why there are two different tags for geocoding locations (i suspect the standardisation process being responsible for that). Looking at the DTD definitions for both evloc and location, one can notice that they both try to describe the same information, but the location tag is actually the better and more detailed way of doing so.

Since we already used the evloc tag for denoting the country where the event primarily took place (i.e. some inverse locus like information ) we had every reason to only use the location tag for the locations of the news as well as the location in the news.

One might also note by looking at the DTD that apparently the NITF standardisation body didn’t consider the scope of the news to be a location and didn’t include any means to include any actual geographic data (e.g. points, lines, polygons, …).

But luckily an arbitrary number of locations can be included via the location tag, unfortunately only allowed in the head section of the document. The DTD of NITF then allows an arbitrary number of country, state, region, city and sublocation tags.

In order to be able to unambiguously represent the hierarchy we restricted this to a single occurence of the tags country, state, region and city as well as up to two sublocation tags.

Before going into detail, below is an example of a news story that has both: scopes and addresses. I guess it is the best way to explain our approach and to describe some of the problems we had / have to navigate. Location relevant part highlighted).

Example

<?xml version="1.0" encoding="UTF-8"?>
<!-- DOCTYPE nitf PUBLIC "-//IPTC-NAA//DTD NITF-XML 3.0//EN" "nitf.dtd" -->
<nitf xmlns:georss="http://www.georss.org/georss">
<head>
<title>Bayern München II schlägt Karlsruhe 3:1</title>
...
<identified-content>
<location class="scope">
<region region-code="09184000" code-source="AGS">München
	<georss:point>11.5725580365 48.1379548096</georss:point>
</region>
<state state-code="09000000" code-source="AGS">Bayern
	<georss:point>11.5725580365 48.1379548096</georss:point>
</state>
<country iso-cc="DEU">Deutschland</country>
</location>
<location class="scope">
<city city-code="09162000" code-source="AGS">München
	<georss:point>11.5725580365 48.1379548096</georss:point>
</city>
<state state-code="09000000" code-source="AGS">Bayern
	<georss:point>11.5725580365 48.1379548096</georss:point>
</state>
<country iso-cc="DEU">Deutschland</country>
</location>
<location class="scope">
<city city-code="08212000" code-source="AGS">Karlsruhe
	<georss:point>8.40437796821 49.0092142029</georss:point>
</city>
<state state-code="08000000" code-source="AGS">Baden-Württemberg
	<georss:point>9.17871582656 48.7750805322</georss:point>
</state>
<country iso-cc="DEU">Deutschland</country>
</location>
<location class="address">
Grünwalder Stadion, Grünwalder Straße, München, Germany
	<georss:point>11.566936 48.101078</georss:point>
<city>München</city>
<region>München</region>
<state>Bayern</state>
<country iso-cc="DEU">Deutschland</country>
</location>

</identified-content>
</docdata>
</head>
<body>
...
</body>
</nitf>

I’ve chosen this story because it is about a soccer game, an example scenario i used in my last post. So we encoded three scopes and one address.

Since it is the a third league game, the editors chose to select only administrative regions covering the cities on a  county level. One city (Munich) is actually divided into two counties, hence the sum of three counties.

If it would have been a premier league game, most likely there only would have been a single scope, the whole of germany whereas a second leugue game would presumably be encoded with some states.

The addresss represents the address of the stadium where the soccer game took place.

So let’s have a closer look at the example ‘s representation.

Representing scopes

<location>
<region region-code="09184000" code-source="AGS">München
	<georss:point>11.5725580365 48.1379548096</georss:point>
</region>
<state state-code="09000000" code-source="AGS">Bayern
	<georss:point>11.5725580365 48.1379548096</georss:point>
</state>
<country iso-cc="DEU">Deutschland</country>
</location>

Remarks:

  • NITF already provides attributes called xxxx-code and code-source for all possible subtags of location, and since we are primarily using the official german coding scheme for administrative regions called “Amtlicher Gemeinde Schlüssel” short: it is natural to encode it  the way we do .
  • The coding scheme of AGS is actually a hierarchically coding scheme (two digits: state, 1-digit: sub-state level (“Regierungsbezirk”), 3-digits: county, 3-digits: city/town) , hence we could do away with the state tag but we decided to be as explicit as we could be.
  • Since we were using the three-letter variant of ISO3166 for the evloc tag we decided to use this variant also for the iso-cc attribute of the country tag.
  • Since we introduced scopes first and some customers wanted to include markers-on their maps they asked for some coordinates. Hence we chose to include the “official” coordinates of the admin region, denoted in some other GIS dataset we bought and chose to make use of simple georss:point tag for doing so
    • We were not allowed to distribute the geometries of the admin regions as part of our licensing deal (Yes you have to buy this data in germany) and the geometries would have used far too much bandwidth for sending them within the wire.
    • In hindsight i would like to remove the coordinates from scope items since the are  a) highly redundant, b) not always available and c) a constant source of discussion what an appropriate representative coordinate for a geographic extent might be
  • We currently use other coding-schemes on a sub-city level for some cities (Also official coding schemes by the city goverment). But since these are hard to come by on a national level, we are currently considering alternatives
  • We are also considering to extend the geocoding to our non-regional, i.e. national and international wires. Her we are looking into using the ISO3166-2 coding scheme and the NUTS3 coding scheme for the European Union

Representing Addresses

<location>
Grünwalder Stadion, Grünwalder Straße, München, Germany
	<georss:point>11.566936 48.101078</georss:point>
<city>München</city>
<region>München</region>
<state>Bayern</state>
<country iso-cc="DEU">Deutschland</country>
Remarks:
  • Addresses are provided by the editor.
  • The level of detail (exact address, strret level, district or city) presented is an editorial decision based on data protection regulations.
  • The address is then geocoded by Google Maps Premiere and the resulting coordinates and the  address returned by the geocoder are shown to the editor and validated by him
  • The this information, togehter with a label of the address is encoded into an NITF location tag in the form: label, address
  • The returned coordinates are also encoded into a georss:point tag.
  • region, state and country are taken from the respective fields of the structured response of the Google geocoder. Hence they might differ in writing from the respective official names. But we chose not to do point in polygon queries in order to harmonize becuase this would have resulted in running a spatially enabled database e.g. Postgres/PostGis.

Customer Uses

I just wanted to give some quick examples how our customers uses the geocodes in the wire.

First an iPhone App that uses the address coordinates to put the news on the map. A typical news map:

D-Ticker Screenshot D-Ticker Scrennshot dticker5

At the other end of the range is the way germany’s biggest tabloid Bild is using the scope information for automatically sorting the news into their different regiona portals. The following screen shots show how content from  is sorted into three diffent regional portals within the state of Northrhine-Westphalia. News that have a scope of the whole state show up in all three portals, whereas news only having a scope of one or more counties are sorted into the regionl portals that contain these counties (better: the AGS codes of thes counties).

Bild Regional Ruhrgebiet Bild Regional Köln Bild Regional Düsseldorf

Next steps

I’m planning to catch up with other aspects of geocoding at dpa in the next days so that i’m finally able to start writing about new ideas :-)

Written by gkamp

October 21st, 2009 at 5:39 pm

Posted in Noteworthy

Tagged with , , ,

Going Places – Wherecamp2008 lightning talk slides

leave a comment

After talking to lots of people about our approach to geocoding news at Where2.0 and Wherecamp2008, and getting positive feedback i decided to give a lightning talk at the second day of Wherecamp2008 to present the ideas to a larger audience.

Since i put up the slides in a hurry they might not adhere to all standards. Nevertheless, they might be interesting to some readers. Hence i decided to upload them to slideshare. (Download as PDF only since Slideshare doen not handle Keynote. Anybody who is interested in these, please contact me via email.)

Written by gkamp

June 2nd, 2008 at 8:34 am

Going Places – Scopes and other locations

leave a comment

Rem: First i have to apologize that i didn’t write this post earlier. But i was busy getting phase 2: geocoding of places within the story out of the door an then i had some holiday, then went to Where2.0 and Wherecamp etc.

But expect a number of articles tagged goingplaces this week.


Earlier Parts of this miniseries: Part I: Adding geographic metadata to news at the source, Part II: Great news from Adrian Holovaty, Part III: Early Experiments, Part IV: Places in News stories vs. Places of News storie


Before showing some examples how our the geocoded stories look like i’ll first introduce the current definitions of the various terms we use and the reasons why we use that definitions. This post particulary builds o top of my last post and you might read it first.

Warning: This may be boring stuff for some of you and you might ask yourself why is this guy trying so hard to find kinda formal definitions. Actually doing this it is fascinating stuff for me (couldn’t negligect help my former AI, ontology building life).

In addition i think it is absolutely essential to have an idea of how and why you are adding geocodes to your content before you are going to do so. Especially when you are a professionnal news organization and reseller. So here we go:

Geocoded News Stories

The first thing to do is to somehow define what a geocoded news story is about. Please note that all following definitions do not attempt to fulfill all the criteria a mathematician has in mind when he hears the term definition but rather to create a comon understanding of what is meant, e.g. more like a glossary entry / dictionary entry.

A geocoded news story
is a news stories that has at least one location attached as accompanying meta-data. Locations include locations of the news story as well as locations in the news story.
In case the story is a complex news story composed of a number of parts (e.g. multiple texts and / or multiple images, a multimedia news story consisting of text, image, video, audi, etc). the locations of the story as a whole is the (multi-)set of the parts of the story.

The most important things to note is that a news story can have multiple locations and tehes locations can have different roles wrt. the story.

Locations of news stories, Locations in news stories

First a kinda formal distinction between locations of news stories and locations in news stories:

A location of a (news) story
is a location assigned to the news story as a whole. This location is not necessarily mentioned in the story itself.

There are different types of locations of a news story, e.g. scope, locus and place of production (see below). Whereas scope and locus are generally geographic names (in our case mostly administrative divisions like states, counties, cities but also other geographic features that have a geographic name), a place of production usually is more specific, e.g. an address.

A location (with)in a (news) story
is a location directly or indirectly mentioned in the news story itself.

Most of the locations in news stories do not fullfil the criteria of a geographic name (see below). They are most likely rather of thekind of business and private addresses, street segments, blocks, or points of interest. Not surprisingly, locations within news stories are typically more specific than locations of news stories (in the sense of having a smaller geographic extent).

You might wonder why i added the indirectly mentioned in the definition. This is in order to take care of the fact that typically not the location itself, but a business, a governmental institution name like a court etc. is mentioned in the story. The actual locations, i.e. the addresses of the relevant branches / offices of a business are not part of the story and have often have to be inferred from the text of the story.

Actually, if you are writing the story you definitely know what locations you mean. Unfortunately, today,this kind of meta-data is almost never attached to the story at the source. I’m currently working hard at work to

  • raise the awareness within our and other news organization that this meta-data is important information and actually a USP
  • look for ways to include this kind of information into our legacy systems without breaking them
  • building tools for enabling the writers to easily add the locations to their stories (e.g looking up addresses in interna and external directories etc.

Geographic Names

As you can see from the above remarks the term geographic name is of special importance in this context, hence it should be defined. we use it according to the definition of the U.S. Board on Geographic Names (from Principles, Policies, and Procedures for Domestic Geographic Names)(emphasis by me):

A geographic name
is a name applied to a geographic feature. It is the proper name, specific term, or expression by which a particular geographic entity is, or was, known. A geographic entity is any relatively permanent part of the natural or manmade landscape or seascape that has recognizable identity within a particular cultural context. A geographic name, then, may refer to any place, feature, or area on the Earth’s surface, or to a related group of similar places, features, or areas.

Scopes

As said above, locations of news stories can be of different types. There are at least the following types: scope, locus and place of production. There are surely additional types of places of news stories i don’t know about. The following are the current definitions of Scope, Locus and Place of production we use. First the definition of scopes (Warning: this is going to be a long section)

A scope of a story
is a geographic name that is part of an (official administrative divisions) hierarchical partition of a defined geographic extent, representing a largest area wrt. the above hierarchy where this story is deemed relevant by an editor

Rem.:

A variant of scopes are legal scopes. These do not describe the area of relevancy but the geographic extent in which it legally allowed to use this content. This is especially often the case with images and other non-textual content.

The basic intention of scopes is to be able to describe a geographic extent as unambigously as possible without actually describing the geometry, while in parallel using a terminology that is known to ordinary people.

Using scopes instead of the actual geometries frees on one hand our customers from having the need to have their own GIS infrastructure in place. Most often the communicated identifiers are sufficient for enabling them to solve their needs. If they need real geographic inferences they are still able to buy the underlying data itself and /or use a webservice to retrieve the geometries.

On the other hand using identifiers and not the geometries is often the only way to enable to communicate geographic extents to our customers. At least in germany it is not possible or very expensive to buy the redistribution rights to the underlying geometries. It alo saves a lot of bandwidth.

Please note that a news story can have multiple scopes and that not all scopes have to be in the same hierarchy. It is only required that every hierarchy in itself is a hierachical partition of a clearly defined geographical extent.

It is absolutely perfect to have one scope being contained in a hierarchy e.g. denoting the administrative divisions of germany as defined in the so called “Amtlicher Gemeinde Schlüssel (AGS)” (Bundesländer ~ states, Regierungsbezirk (no equivalent in the US), Kreis ~ county, Stadt/Gemeinde ~city/town/village) and another scope belonign to a second hierarchy denoting the adminstrative subdivisions of a certain city, e.g the boroughs and districts of Hamburg, the neigbourhoods of hamburg as defined by some company or community, the zipcodes of germany etc.

I also think that being able to add metadata describing a geographic extent to which content is deemed relevant would benefit all kinds of content, ranging from tweets, (e.g. please notify only my friends in the City of San Franciso that i’m coming to town, because i’m only there for 2 hours, and other friends in California wouldn’t make it in time) to blog posts since they are basically news stories to wikipedia entries.

Another way of looking at scopes is as hints of what to expose at what zoom level on a map. For doing so you don’t need complex calculations. Adding some information/ access to the bounding box of the scope in order to be able to do so,

Hierarchical partitions

Since my early experiences i don’t believe in grand unified theories / ontologies, that try to the model of a domain. I rather believe in sets of small, very domain specific ontologies. The notion of a hierarchical partition for a certain extent does originate from this belief. It encapsulates the partonomy relationships for localities of a coherent set of types.

From an engineering point of view the notion of a hierarchical partition also allows us to loosely couple the different hierarchies.

So what is a hierarchical partition of a defined geographic extent? And why do we care? To answer the second question first:

  • We have somehow to explain what we are doing to our customers (and ourselves)
  • If we happen to come up with a definition that has a nice set of properties we might be able to use algorithms that take advantage of these properties. The following shows that our understanding / definition of what a hierarchical partition is evolved over time.
I first interpreted hierachical partition in the pure mathematical sense, i.e.:

  • for any given point in the plane within the defined geographic extent there is exactly one corresponding scope on each level of the hierarchy
  • for any given scope there is exactly one predecessor wrt. this hierarchy.

But looking at the administrative division of germany i recognized that this is actually not the case. and the first criterion has to be relaxed. This stems from the following facts.

  • There are counties denoting cities (so called Stadtkreise) that are not represented in the “city/town/” level of the AGS hierarchy. e-g there “holes” at this level. While this fact may be worked around by adding these counties into the city level of the hierarchy.
  • Some states do not have so called “Regierungsbezirke”, they eliminated this level at some time in the past. Hence there are also holes at this layer.

The following changes of the rules would take care of these facts:

  • for any given point in the plane within the defined geographic extent there is at least one corresponding scope in some level of the hierarchy
  • for any given point in the plane within the defined geographic extent there is at most one corresponding scope on every level of the hierarchy
  • for any given scope there is exactly one predecessor wrt. this hierarchy.

So i thought that this division was sufficient to cover also the adminsitrative subdivisision of other countries. But when validating this defintion against the administrative divisions of the United States i learned that New York City is an aggregate of 5 counties of the state of New York , each county being coterminous with a borough of New York City. Taking care of that and hopefully preparing ourselves of other “strange” cases we end up with the following definition of a hierarchical partition:

A hierarchical partition p of scopes of a geographic extent e
is a directed acyclic graph (DAG) with the following properties:

  1. There is a single source s_top (the top level scope) with a geographic extent being coterminous with the geographic extent (using coterminous as having matching boundaries interpretation
  2. every scope has a property denoting its level in the hierarchy with the top level scope having the level 1
  3. for any given point p in e there is at least one corresponding scope s(point) at some level in the DAG
  4. for every scope that has more than one successor the geographic extent of set of successors is coterminous with the geographic extent of this scope
  5. for every scope that has more than one predecessor the geographic extent of set of predecessors is coterminous with the geographic extent of this scope

Rem:

  • This definition is definitely not perfect in it’s formulation but some of you might help me with improving it. It might also be better to start with a poset based definition and switch to a graph based definitin when introducing additional relations, e.g. topological relations describing adjointness etc.
  • It might be helpful to distinguish between hierarchical partitions and leveled hierarchical partitions with the difference between two two is the fact if the scopes are assigned levels, or not.
  • Why is it important to have the level information you might ask? It is necessary in order to transcribe the semantics
  • In order to not lose the two stricter definitions the first one is defined as a strict partition hierarchy, whereas the second is a partition hierarchy.
  • I haven’t found the time to look deeper into this, but it looks like that it is likely that a (leveled) hierarchical partition already has been assigned a name somewhere in mathematics.  IIf someone out there happens to know where to look (computational geometry?) would love to know about it.
  • It also seems to be the case that if you add a an additional level with a single node s_bottom that is the successor of every leaf node in the hierachical partition, you get a lattice.  Maybe some of the lattice properties and knowledge /algorithms for lattices might prove helpful

After this very extensive coverage of scopes i just briefly introduce the current definitions of loci and places of production. This is mostly the case because right now there are only some ideas how a definition of these should look like.

A locus of a story
  • is a geographic name contained in a set of geonames of a defined geographic extent,
  • representing the / a smallest area wrt. the above hierarchy where the events of this story are happening / have happened / are going to happen

Rem.:
Initially a locus was also defined a being part of a hierarchical partition. This gains the advantage to being able to unambigously describe the locus withtin that hierachy (at least at each level). But while this is a property that is important for scopes, in fact that is the main purpose of scopes, being able to use names that are typically used e.g natural features like mountain ranges etc. is more important than being unambigous.

A place of production of a story
  • is a location where the news story (or parts of it) were produced (e.g. written by the author, edited by the editor, …)
  • describing as exact as possible the geographic position of the production (e.g. using geographic coordinates, addresses, …)

Locations within news stories

A location (with)in a (news) story is a location directly or indirectly mentioned in the news story itself. These locations are typically not geographic names but rather addresses, street segments, blocks, or POIs. Not surprisingly, locations within news stories are typically more specific (in the sense of having a smaller geographic extent) than locations of news stories.

What’s next?

In the next post i’m sketching the current status and am finally giving you some examples.

Written by gkamp

June 2nd, 2008 at 8:23 am

Going Places – Places in News stories vs. Places of News stories

4 comments

4 Levels of the adminstrative hierarchy of Northrhine-Westphalia (state, county, city/town, districts)

What’s the scope?

This part of the miniseries is actually the heart of the whole series. It describes why and how our approach differs AFAIK from all other approaches of geocoding news, including the approaches taken recently by Google News with it’s local news extension ( see here and here) and Yahoo with it’s Newsglobe.It is an approach that augments these approaches by not only assigning the places in the news but in addition also assigning places of the news, especially one kind of a place of a news story, we call scope of a news story that describes the geographic area(s) of relevance of a news story.Scope are in our opinion at least as important as the places in a news story. As the examples below show determining the geographic scope of a story is a task that reporters and editors have done for ages.Since determining the geographic scope of a news story is much more difficult than recognizing places mentioned in a story it is also much less likely to be suspect to automatic identification via named entity recognition and the likes. (BTW. Holovaty calls everyblocks variant of named entity recognition “geoparsing” in his recent OJR interview, Google calls their variant “but instead we analyze every word in every story to understand what location the news is about and where the source is located.”) .Hence determining the scopes of news stories is something where the news provider at the source has a real advantage / USP.

In the remainder of this post i’m trying to convince you that distinguishing between places in stories and places of stories makes sense and that adding metadata about the scopes of news stories is useful and important for news providers. If you disagree or have comments i’m eager to hear these either as comments or via mail to relations at ka2 dot de.


Earlier Parts of this miniseries: Part I: Adding geographic metadata to news at the source, Part II: Great news from Adrian Holovaty, Part III: Early Experiments, …)

Use-cases for geocoded news

In order to motivate this argument it makes very much sense to step back for a moment and ask: “What are the various use-cases for geocoded news that should be supported?”

Use case 1: Putting the places in news stories on a map

Ever since the advent of Google Maps the one use case that typically is cited in conjunction with geocoding news is putting the news on map, i.e. for every news story a pin or a number of pins is virtually pushed into a map. Every pin in the map denotes a location IN the news story. This is the use case i mostly talked about until now and that everybody knows from all the various approaches to geocoding the news.

There is a variant of this use case where not an overview map of a number of stories is shown, but a map only detailing the locations in a single news story. This use case is very common in TV and broadcast news, but rarely seen online.

Use case 2: Defining the geographic area of relevance

But actually there is one other use case that in our opinion is as least as important as the first one. This use case is about defining the geographic area(s) of relevance of a news story. IMHO this is a use case that editors and reporters are already used to since ages. This second use case is important even in cases where putting news stories on a map is not the focus.

And it is a use case that provider of purely technical solutions for geocoding news stories cannot easily automize.

What do i mean with “This is a use case that editors are used to since ages?” Well, ever since the inception of the newsroom editors have thought of news stories in terms of:

  • Is this story newsworthy nationwide or only on a regional or maybe just a local level?
  • Do i print this story n the first section of the newspaper or do i print it in the local section?
  • Do i dispatch this story into the national newswire or only a regional newswire?

By attaching metadata describing the geographical area(s) of relevance to the news stories, our approach takes this kind of thinking to the logical next level (and into the digital age).

In order to describe why this use case is important evene without mapping applications i want to give you an example from our own operations:

Right now dpa-infocom is delivering 12 regional news wires. Their coverage corresponds to the 16 german states (“Bundesländer”) where 4 news wires cover the area of two german states. A newspaper customer that wants to add local news from our regional news wires to his web site had until recently only the possibility to add these news items on a wire level.

This means that a newspaper located in a city on the border of two states. had to add two news wires. But within theses news wires there were not only stories deemed covering this city but also cities in the opposite corner of the respective states, cities sometime a couple of hundred of kilometers away. News stories definitely not suitable to the local pages of his website.

With the addition of the geographic area(s) of relevance to the news, our customers now can identify the news stories that are deemed relevant on a national, state-wide, county-wide or city/town/locality-wide level (for select bigger cities even on a borough or district-level). This enables the customer to filter the wires accordingly, not only ignoring stories attached to counties that are irrelevant to him but also sorting the relevant stories e.g. into it’s county-wide local editions and websites.

It also allows us to realize specializes wires defined purely by a filter on the metadata of the news items, aggregating only the news stories a certain customer is interested in.

What is a scope of a news item?

This question is best illustrated with the series of screenshots below. They show the hierachical administrative partition of germany.

Germany is on the first level divided into 16 states (“Länder”). These are themselves divided into a total of forty-something second-level units called “Regierungsbezirke” and a total of 439 third-level regions called “Kreise” (roughly equivalent to counties). On the fourth level there are some 12000 localities (cities, towns, etc.). depending on the size of the city, there might be one or two administrative levels within the city (boroughs and districts).

udig-bundesrepublik.png First level administrative regions of germany (”Bundesländer”) Second-level administrative regions of germany (”Regierungsbezirke”) Third-level administrative regions of germany (”Kreise”) 4th and 5th level administrative regions of germany (”Gemeinden”) Districts of select cities
The scope of a news story in our regional news wires is a set of administrative regions form this selection. I’m going to explain why we’ve chosen the administrative regions as the basis for defining the scope of a news item in the next post when i’m going to formalize our approach

Hence adding geographic areas of relevance (either districts, boroughs, cities, counties or states) is IMHO an important way of “designating granular locations” of news stories.

Examples

Unfortunately our regional news wires are in german, hence including concrete examples in this post wouldn’t help too much. Hence i tried to abstract from the concrete examples to some typical news stories regularily found on our wires. I’m sure you come up with examples of your own in no time.

I also try to illustrate why i think that automisation of designating scopes of stories is difficult by giving alternative scopes, depending on sowm twists within the story.

  • A story about new legislation in a certain state will be assigned a state-wide area of relevance, although the dateline typically show the states capitol.

A news story having a state-wide geographic area of relevance

  • A story about a car accident in city X with the driver of the car coming from the town Y, the news story will most likely be assigned two city/town-wide areas of relevance X, and Y. If it is a serious car accident with multiple dead this might be changed to a single county- or state-wide area of relevance.

A news story having two city-wide geographic areas of relevance

  • A story about a soccer game where teams from district A and district B are playing against each other will most likely be assigned two geographic areas of relevance district A and B. If the game is the final of city-wide cup, this most likely will be changed to the whole city. If the two teams happen to play in a professional sports league this news story most-likely will get a nation/country-wide area of relevance, unless they are playing a friendly or benefit game against each other.

A news story having two district-wide geographic areas of relevance

  • A story about a series of burglaries in districts A,B,and C in town X and districts D and E in town Y, that is on trial at a court in city Z, is most likely to be assigned the geographic areas of of X,Y (maybe also Z), unless the burglaries itself were exceptional, e.g burglaries of famous artworks. This most likely results in a state-wide or nation-wide geographic area of relevance.

What next?

In the next post of this miniseries i’m going to present a formalization of what i’ve described informally in this post. I’m also going to present who we are representing the meta-data within our wire feeds, and sowm ideas about representing them within “standard” feedformats like ATOM and RSS2.0.

In another blog post i’m then going to cover our approach to the first use case mentioned above: Designating places in news stories.

I’m probably then going to write about the problems with getting access to the geometries of the administrative region and preparing the data. This most likely will end up in a rant about government hindering innovation by not making public information public. I might also include some ideas how to circumnavigate these issues.

Written by gkamp

February 26th, 2008 at 9:33 am

Posted in IMHO

Tagged with , , , ,

Going Places – Early Experiments

leave a comment

Bevor writing about our approach to geocoding news stories and elaborating on our key findings, i just wanted to do a quick post on some of our early experiments.

Geocoding based on the dateline: The easy route (to nowhere)

geonewsshowcase2008.pngWe could have taken the easy route and used the Google geocoder (or some other geocoder) and the city location(s) mentioned in the dateline of our stories to geocode the news. Generate some GeoRSS and/or KML feed and your done, i hear you say.

Well besides all kinds of reasons that are specific to the news agency business (e.g. that our customers are still often not used to RSS feeds, main modus of delivery of news agency news is still satellite and/or FTP push/pull, …) this also wouldn’t have been sufficient if we were a startup being completely free in its decisions.

It would have been sufficient if our goal would have been to generate some eye-candy (e.g. More or less arbitrarily stick pins into a map).

In fact i did exactly this back in early August 2006 in order to motivate and showcase the possibilities of geocoded news. You still can find the remains of that showcase over here. All in all it took me a couple of hours to do this showcase (adding generating flyovers with GoogleEarth, testing various other mapping libraries etc. this may add up to a couple of days).

But when being asked about the semantics of the chosen locations all we would be able to say:

Well, somebody decided that these cities should be part of the dateline.

Let me xplain why.

Dateline semantics

Originally the cities on the dateline stated the place where a story was written. Hence they had a clearly defined semantics. Unfortunately this semantic may be not the semantic that is of utmost important to the slew of the readers of the story.

But since in the early days of the news agencies the reporter had to be at the place of the scene in order to be able to report it the dateline locations most often coincided with the locations where the news actually happened.

Hence it was a relative good heuristic to associate the semantic with the locations mentioned in the dateline. But with the advent of telemedia this heuristic is no longer valid. Unfortunately this led to the common usage where the dateline contains a number of locations mixing both semantics without clearly indicating which location carries which semantic.

E.g an article reporting on Macworld might carry a dateline of San Francisco, Hamburg / San Francisco, Hamburg, or San Francisco / Hamburg.

So, geocoding news purely based on the dateline is a suboptimal idea. A better idea might be to use the geographic entities mentioned within the story.

Named entity recognition

But then how do you identify geographic entities in natural language text without too much manual work. This is typically a task solved by so called named entity recognition systems (or short NER) a subdiscipline of computational linguistics.

Unfortunately most of the NER software systems couldn’t hide their provenance from originally being implemented for “governmental uses” and are build by companies residing in Virginia. Typically this means that they are:

  • (very) expensive
  • available for english, arabic, persian, … with western-european languages other than english coming in as a distant n-th

Since we are publishing our regional news wires in german and we do not have that much money to spent, automatic extraction of named-entities by using named entity recognition software was postponed.

Actually, given my research background in AI, a number of my friends are working in computational linguistics and information extraction. Hence i tried to keep current with the state of the art in these systems in my spare time, and likely will give an overview on this in another article in this mini series. This article will also include an overview of the freely available web services for named-entity recognition i’m currently aware of.

In this context it would also be very interesting to learn from Adrian, which kind of algorithms everyblock uses when he says (in his Poynter interview)

Currently, we do that by crawling news sites and applying algorithms and human editing efforts.

Especially, since everyblock is using openlayers and tilecache, which are developed by Christopher Schmidt (and others) and sponsored by MetaCarta, a company specialising in identifying locations in text and providing fast access via specialised index structures. (Needless to say that Metacarta has an office in Vienna, VA and In-Q-Tel is one of their investors.)

Unfortunately the company did not reply to the various requests for information i placed via the web in the last year. I would be very intersted to learn more about their technology.

Written by gkamp

February 1st, 2008 at 6:51 pm