≈ Relations

Random Rants and Ramblings about Media and/or Technology

Archive for the ‘geo’ tag

Google launches reverse geocoder

leave a comment

Google just added a functionality to its MAP API  i was waiting for for quite some time: a reverse geocoder. When our editorial staff is geocoding the news, it is often the case the initial address is slighty wrong (or approximate). Hence they have to move that marker a bit and get the new coordinates and address back. While it was easy to have the coordinates until now the new address (even an approximate one had to by entered manually using some guesswork).

At a newswire every second counts, hence it is important do to as much as semi-automatically as possible. So the availability of a reverse geocoder is good news, because it translates coordinates into an approximate address, that we then can use to pre-populate the address fields. The editor have to type less metadata (they just have to delete some information that is too detailed)

I’ve quickly thrown together a concept prototype that illustrates the concept of the editor interface over here (Code ugly as hell, mainly cut n’ paste from Google sample code. Major functionality missing, no styling etc.). It has to be refined and integrated into our Java based CMS. And yes we have a Google Maps Premier license for using Google maps in an intranet application. And no there were no affordable reverse geocoding alternatives for Germany until now(at least affordable for us).

Why did we have to wait so long

So why did we have to wait so long for a reverse geocoder from Google? My guess is that this is closely coupled to the switch to TeleAtlas data for the geocoding API a couple of weeks ago. For years maps.google.com and the API were using different datasets for the geocoding. The google apps were using TeleAtlas whereas the APi was using Navteq. I once was told by some Google engineer that this was due to the unwillingness of TeleAtlas to license its data for API use. Don’t know if that is correct, but now that the API is using the TeleAtlas data they might also have struck a deal to use reverse geocoding data.

P.S.: I’m attending the Google Maps Premier workshop in London on Nov. 7th. Anybody else? And does anybody happen to know a swing component that integrates a decent browser engine (e.g. Webkit, Gecko) into Java Apps running under Window XP?. Right now we are with an IE component.

Written by gkamp

October 23rd, 2008 at 8:28 am

Posted in Noteworthy

Tagged with , ,

Going Places – Places in News stories vs. Places of News stories

4 comments

4 Levels of the adminstrative hierarchy of Northrhine-Westphalia (state, county, city/town, districts)

What’s the scope?

This part of the miniseries is actually the heart of the whole series. It describes why and how our approach differs AFAIK from all other approaches of geocoding news, including the approaches taken recently by Google News with it’s local news extension ( see here and here) and Yahoo with it’s Newsglobe.It is an approach that augments these approaches by not only assigning the places in the news but in addition also assigning places of the news, especially one kind of a place of a news story, we call scope of a news story that describes the geographic area(s) of relevance of a news story.Scope are in our opinion at least as important as the places in a news story. As the examples below show determining the geographic scope of a story is a task that reporters and editors have done for ages.Since determining the geographic scope of a news story is much more difficult than recognizing places mentioned in a story it is also much less likely to be suspect to automatic identification via named entity recognition and the likes. (BTW. Holovaty calls everyblocks variant of named entity recognition “geoparsing” in his recent OJR interview, Google calls their variant “but instead we analyze every word in every story to understand what location the news is about and where the source is located.”) .Hence determining the scopes of news stories is something where the news provider at the source has a real advantage / USP.

In the remainder of this post i’m trying to convince you that distinguishing between places in stories and places of stories makes sense and that adding metadata about the scopes of news stories is useful and important for news providers. If you disagree or have comments i’m eager to hear these either as comments or via mail to relations at ka2 dot de.


Earlier Parts of this miniseries: Part I: Adding geographic metadata to news at the source, Part II: Great news from Adrian Holovaty, Part III: Early Experiments, …)

Use-cases for geocoded news

In order to motivate this argument it makes very much sense to step back for a moment and ask: “What are the various use-cases for geocoded news that should be supported?”

Use case 1: Putting the places in news stories on a map

Ever since the advent of Google Maps the one use case that typically is cited in conjunction with geocoding news is putting the news on map, i.e. for every news story a pin or a number of pins is virtually pushed into a map. Every pin in the map denotes a location IN the news story. This is the use case i mostly talked about until now and that everybody knows from all the various approaches to geocoding the news.

There is a variant of this use case where not an overview map of a number of stories is shown, but a map only detailing the locations in a single news story. This use case is very common in TV and broadcast news, but rarely seen online.

Use case 2: Defining the geographic area of relevance

But actually there is one other use case that in our opinion is as least as important as the first one. This use case is about defining the geographic area(s) of relevance of a news story. IMHO this is a use case that editors and reporters are already used to since ages. This second use case is important even in cases where putting news stories on a map is not the focus.

And it is a use case that provider of purely technical solutions for geocoding news stories cannot easily automize.

What do i mean with “This is a use case that editors are used to since ages?” Well, ever since the inception of the newsroom editors have thought of news stories in terms of:

  • Is this story newsworthy nationwide or only on a regional or maybe just a local level?
  • Do i print this story n the first section of the newspaper or do i print it in the local section?
  • Do i dispatch this story into the national newswire or only a regional newswire?

By attaching metadata describing the geographical area(s) of relevance to the news stories, our approach takes this kind of thinking to the logical next level (and into the digital age).

In order to describe why this use case is important evene without mapping applications i want to give you an example from our own operations:

Right now dpa-infocom is delivering 12 regional news wires. Their coverage corresponds to the 16 german states (“Bundesländer”) where 4 news wires cover the area of two german states. A newspaper customer that wants to add local news from our regional news wires to his web site had until recently only the possibility to add these news items on a wire level.

This means that a newspaper located in a city on the border of two states. had to add two news wires. But within theses news wires there were not only stories deemed covering this city but also cities in the opposite corner of the respective states, cities sometime a couple of hundred of kilometers away. News stories definitely not suitable to the local pages of his website.

With the addition of the geographic area(s) of relevance to the news, our customers now can identify the news stories that are deemed relevant on a national, state-wide, county-wide or city/town/locality-wide level (for select bigger cities even on a borough or district-level). This enables the customer to filter the wires accordingly, not only ignoring stories attached to counties that are irrelevant to him but also sorting the relevant stories e.g. into it’s county-wide local editions and websites.

It also allows us to realize specializes wires defined purely by a filter on the metadata of the news items, aggregating only the news stories a certain customer is interested in.

What is a scope of a news item?

This question is best illustrated with the series of screenshots below. They show the hierachical administrative partition of germany.

Germany is on the first level divided into 16 states (“Länder”). These are themselves divided into a total of forty-something second-level units called “Regierungsbezirke” and a total of 439 third-level regions called “Kreise” (roughly equivalent to counties). On the fourth level there are some 12000 localities (cities, towns, etc.). depending on the size of the city, there might be one or two administrative levels within the city (boroughs and districts).

udig-bundesrepublik.png First level administrative regions of germany (”Bundesländer”) Second-level administrative regions of germany (”Regierungsbezirke”) Third-level administrative regions of germany (”Kreise”) 4th and 5th level administrative regions of germany (”Gemeinden”) Districts of select cities
The scope of a news story in our regional news wires is a set of administrative regions form this selection. I’m going to explain why we’ve chosen the administrative regions as the basis for defining the scope of a news item in the next post when i’m going to formalize our approach

Hence adding geographic areas of relevance (either districts, boroughs, cities, counties or states) is IMHO an important way of “designating granular locations” of news stories.

Examples

Unfortunately our regional news wires are in german, hence including concrete examples in this post wouldn’t help too much. Hence i tried to abstract from the concrete examples to some typical news stories regularily found on our wires. I’m sure you come up with examples of your own in no time.

I also try to illustrate why i think that automisation of designating scopes of stories is difficult by giving alternative scopes, depending on sowm twists within the story.

  • A story about new legislation in a certain state will be assigned a state-wide area of relevance, although the dateline typically show the states capitol.

A news story having a state-wide geographic area of relevance

  • A story about a car accident in city X with the driver of the car coming from the town Y, the news story will most likely be assigned two city/town-wide areas of relevance X, and Y. If it is a serious car accident with multiple dead this might be changed to a single county- or state-wide area of relevance.

A news story having two city-wide geographic areas of relevance

  • A story about a soccer game where teams from district A and district B are playing against each other will most likely be assigned two geographic areas of relevance district A and B. If the game is the final of city-wide cup, this most likely will be changed to the whole city. If the two teams happen to play in a professional sports league this news story most-likely will get a nation/country-wide area of relevance, unless they are playing a friendly or benefit game against each other.

A news story having two district-wide geographic areas of relevance

  • A story about a series of burglaries in districts A,B,and C in town X and districts D and E in town Y, that is on trial at a court in city Z, is most likely to be assigned the geographic areas of of X,Y (maybe also Z), unless the burglaries itself were exceptional, e.g burglaries of famous artworks. This most likely results in a state-wide or nation-wide geographic area of relevance.

What next?

In the next post of this miniseries i’m going to present a formalization of what i’ve described informally in this post. I’m also going to present who we are representing the meta-data within our wire feeds, and sowm ideas about representing them within “standard” feedformats like ATOM and RSS2.0.

In another blog post i’m then going to cover our approach to the first use case mentioned above: Designating places in news stories.

I’m probably then going to write about the problems with getting access to the geometries of the administrative region and preparing the data. This most likely will end up in a rant about government hindering innovation by not making public information public. I might also include some ideas how to circumnavigate these issues.

Written by gkamp

February 26th, 2008 at 9:33 am

Posted in IMHO

Tagged with , , , ,

Going Places – Early Experiments

leave a comment

Bevor writing about our approach to geocoding news stories and elaborating on our key findings, i just wanted to do a quick post on some of our early experiments.

Geocoding based on the dateline: The easy route (to nowhere)

geonewsshowcase2008.pngWe could have taken the easy route and used the Google geocoder (or some other geocoder) and the city location(s) mentioned in the dateline of our stories to geocode the news. Generate some GeoRSS and/or KML feed and your done, i hear you say.

Well besides all kinds of reasons that are specific to the news agency business (e.g. that our customers are still often not used to RSS feeds, main modus of delivery of news agency news is still satellite and/or FTP push/pull, …) this also wouldn’t have been sufficient if we were a startup being completely free in its decisions.

It would have been sufficient if our goal would have been to generate some eye-candy (e.g. More or less arbitrarily stick pins into a map).

In fact i did exactly this back in early August 2006 in order to motivate and showcase the possibilities of geocoded news. You still can find the remains of that showcase over here. All in all it took me a couple of hours to do this showcase (adding generating flyovers with GoogleEarth, testing various other mapping libraries etc. this may add up to a couple of days).

But when being asked about the semantics of the chosen locations all we would be able to say:

Well, somebody decided that these cities should be part of the dateline.

Let me xplain why.

Dateline semantics

Originally the cities on the dateline stated the place where a story was written. Hence they had a clearly defined semantics. Unfortunately this semantic may be not the semantic that is of utmost important to the slew of the readers of the story.

But since in the early days of the news agencies the reporter had to be at the place of the scene in order to be able to report it the dateline locations most often coincided with the locations where the news actually happened.

Hence it was a relative good heuristic to associate the semantic with the locations mentioned in the dateline. But with the advent of telemedia this heuristic is no longer valid. Unfortunately this led to the common usage where the dateline contains a number of locations mixing both semantics without clearly indicating which location carries which semantic.

E.g an article reporting on Macworld might carry a dateline of San Francisco, Hamburg / San Francisco, Hamburg, or San Francisco / Hamburg.

So, geocoding news purely based on the dateline is a suboptimal idea. A better idea might be to use the geographic entities mentioned within the story.

Named entity recognition

But then how do you identify geographic entities in natural language text without too much manual work. This is typically a task solved by so called named entity recognition systems (or short NER) a subdiscipline of computational linguistics.

Unfortunately most of the NER software systems couldn’t hide their provenance from originally being implemented for “governmental uses” and are build by companies residing in Virginia. Typically this means that they are:

  • (very) expensive
  • available for english, arabic, persian, … with western-european languages other than english coming in as a distant n-th

Since we are publishing our regional news wires in german and we do not have that much money to spent, automatic extraction of named-entities by using named entity recognition software was postponed.

Actually, given my research background in AI, a number of my friends are working in computational linguistics and information extraction. Hence i tried to keep current with the state of the art in these systems in my spare time, and likely will give an overview on this in another article in this mini series. This article will also include an overview of the freely available web services for named-entity recognition i’m currently aware of.

In this context it would also be very interesting to learn from Adrian, which kind of algorithms everyblock uses when he says (in his Poynter interview)

Currently, we do that by crawling news sites and applying algorithms and human editing efforts.

Especially, since everyblock is using openlayers and tilecache, which are developed by Christopher Schmidt (and others) and sponsored by MetaCarta, a company specialising in identifying locations in text and providing fast access via specialised index structures. (Needless to say that Metacarta has an office in Vienna, VA and In-Q-Tel is one of their investors.)

Unfortunately the company did not reply to the various requests for information i placed via the web in the last year. I would be very intersted to learn more about their technology.

Written by gkamp

February 1st, 2008 at 6:51 pm

Going Places – Great news from Adrian Holovaty

one comment

In the first installment of this mini-series i announced that dpa-infocom is geocoding it’s regional newswires and that i’m going to report on the rationale behind doing so, as well as on the solution we ended up using today and our roadmap in forthcoming posts.

One of the main reasons for doing this mini-series is getting the discussion about a/the semantics for geocoding news stories started and working towards a standardisation of the syntax and semantics.

Three days later Adrian Holovaty announces everyblock.com. And in the Poynter interview following the launch of everyblock.com Adrian was asked:

Tompkins: How do you hope newsrooms will adapt your ideas and even your code to their own work?

Holovaty: We’re interested in spreading the concept of “geocoding” news — that is, classifying news articles by location. Currently, we do that by crawling news sites and applying algorithms and human editing efforts, but it’d be best for everybody if news organizations did this on their own. We’re interested in developing some sort of specification/standard for designating granular locations in news stories (Emphasized by me) — look for more about that from us soon.

This is great news! Obviously i agree that news organizations should geocode their news stories on their own. If not, we wouldn’t do so. And we are very interested in working towards a spec / standard. It is for this reason that I’m evangelizing the need for geocoding news not only within dpa (one of the worlds largest news agencies) but am also communicating our approach at various occasions when meeting other news agencies.

For example last September i presented our approach to geocoding news at the inaugural meeting of MINDS International, an association of currently 11 news agencies focusing on the exchange of ideas and solutions between agencies in the online and mobile area.

The key: “Designating granular locations”

Not very surprisingly Adrian hit the nail on the head by stating that a spec/standard for designating granular locations is what is needed most in order to geocode news stories.

Actually, identifying that granular location representations are key to geocoding news stories was one the key learnings on our own road to geocoding news. Other key learnings where:

  1. It is essential to distinguish between locations of news stories and locations in news stories.
  2. Locations of news stories are at least as important as locations in news stories (at least for news agencies)
  3. The most important type of location of a news story is the scope of a news stories, i.e. its geographic area of relevance

In the next couple of posts i’m going to elaborate on these key learnings and present our  approach.

Hopefully this and the following posts will get a discussion as well as a joint effort for a common spec started.

Written by gkamp

February 1st, 2008 at 6:19 pm

Posted in IMHO

Tagged with , , ,

Going Places – Adding geographic metadata to news at the source

3 comments

Over one and a half years ago i embarked onto the mission to bring geographic metadata to the wires of dpa. After quite some convincing in April started defining and redefining the roadmap, the semantics and the syntax of the metadata, designing and building the support process into our editorial systems. Since mid december we are geocoding each and every newsitem in our RegioLine wire.

Now i’m proud to say that our first customer is using this metadata for visualising the news on a newsmap.

bildnavigator2.png bildnavigator2.png

With this post i’m starting a mini series about my experiences wrt. geographic metadata during this last one and half years. Especially i want to start a discussion about the semantics for the geographic metadata i’ve defined and report on the obstacles i encountered.

Written by gkamp

January 21st, 2008 at 10:11 am

Posted in Noteworthy

Tagged with , , , ,