Why feeds matter (or: media and syndication – get it or lose it)

Preface

In earlier posts (see here and here) i have argued that lightweight syndication via RSS and Atom opened up a completely new ecosystem while the syndication standards preferred by “main stream media (MSM)” like ICE didn’t take off and most likely won’t do so in the future (if the upcoming ACAP standard is not radically different form the glimpses i’ve seen so far) (more on it in another post). So in this article i’m trying to follow up on this issue and its relation to the Google News debate.

So here we go, its going to be a long and winding road. Since i’ve learned that the terms RSS/Atom, Personalized Homepage, Feedreader and the likes are not recognized by everybody in the audience (at when your speaking to a general audience), i’ll go a little bit deeper than usual explaining the terms especially when describing personalized homepages and feedreader)

Lightweight Syndication and Newspapers: Status

It is not surprising that the only party that is largely missing from the lightweight syndication field is the “traditional media”. A Bivings study showed that 76 of 100 american newspapers provide RSS/Atom feeds, all of them partial, all of them without ads. A quick count i’ve done sometimes in late 2006 with the help of the then current results of Wortfelds survey showed that only of 35 out of 83 german newspapers had any RSS feeds. I don’t think that the numbers have shifted dramatically in the last 6 months.

Given the ACAP announcement one could guess that the newspapers have held back their content feeds because they feared
that their content would be abused when they use RSS/ATOM and waited for a “secure” alternative. I don’t think that this the only reason. IMHO, it is also likely that they have neither seen the (business) opportunities of lightweight feeds (e.g. extending reach, including ads into the feeds) nor an easy way to integrate it into their existing (not so state of the art) content management systems.

The german newspapers that do lightweight syndication only use it as a teaser medium for their traditional websites. No newspaper either provides a full-text feed or ads within their feeds. A lot of newspaper feeds only consist of the headlines deep linking to the article page of the website. This is likely due to the same omnipresent fear that somebody is going to steal their valuable – or as it is called in the ACAP press release – high-value content that keeps them sueing Google for Google News. Interestingly, at DLD07 Tariq Krim (the CEO of netvibes) reported that at netvibes they are observing that the more content a feed has, the more often the reader of the feed goes to the actual website. So empirically, there is no reason for this fear.

AFAIK, the position of the newspapers is only very slowly shifting towards providing RSS/ATOM-Feeds for all major sections of a newspapers website. I attribute this move more to pressure from the readers and people with more of a web 2.0 background coming into main stream media companies than to an original understanding of the need and benefits of doing so.

2007: The Year of Content Decoupling, Personalization and Remixing

There are lot more ways to consume news (or more generally content) than the (home) page of a website. On the mobile phone, on the big tv screen, in a desktop app, in a widget/gadget/whateveryoucallit in a personalized homepage, in the screensaver, on devices like the chumby (very interesting – another post to be written), in 3d globe vizualizations like Google Earth, on Maps, …

And there is an unified mechanism for delivering the news/content: RSS/ATOM feeds.. There is an ever growing bunch of browsed-based or standalone feedreaders available and integrating feeds into a custom application is more likely a matter of hours than a matter of weeks (to some big extent due to Mark Pilgrims’ the excellent python based universal feedparser).

Hence there is a lot of reason to believe that 2007 is the year when feeds are finally taking off big time and the year will be remembered as the year of the decoupling of the content from the web page. At least this is the technical term i used over the last year to describe my observations in this area to colleagues and customers.

Not very surprising, one of the most important revelations for me during this years DLD Conference was that also people that are better known than i am made the same observations. Catarina Fake coined the term “delamination” while Tariq Krim used the term “deportalization” for describing the same phenomenon. A couple of days ago Jeff Jarvis brought this phenomenon to the attention of a greater audience with a great article called “After the page” and today Thomas Knüwer o discusses the topic to some extent in the context of yet another relaunch of a german newspaper.

Personalized homepages, feedreader and the likes bring not only to the user the freedom of reading whereever you want to read it. They give also the reader the freedom to select, group, aggregate, process and share the news that are relevant / interesting to him. In this article i’ll focus on the the reading, sharing and aggregating aspects and leave the expert topics like processing, real remixing applications like Yahoo! Pipes, Microsoft Popfly (since when is Microsoft using the .ms toplevel domain for its products?) etc. and applications based on the Atom Publishing Protocol (APP) like the Google GData APIs for another article.

Feed reading, aggregating, sharing: My Personal experience

In order to motivate the capabilities that feeds offer i’ll start with my own personal experience using them. I think it will be a very typical experience for most of the people trying feed based applications. Readers that already are experienced can easily skip this section.

I started using feeds as my main information source on the internet in Q3/Q4 2006. Initially i switched from the traditional homepage of a german IT magazine to a personalized homepage provided by a startup called netvibes. Google Reader wasn’t available /usable back then ( i.e. 6 month ago) and desktop feedreading applications weren’t useful for me because a lot of different computers (and devices) for feedreading and wanted to have a consistent state of what i’ve already read accross all of them. Hence only a web based bookkeeping of what i’ve read.

I tried Google Reader sometimes in Oct./Nov. 2006 and used netvibes and Google Reader some time in parallel. Today my browser homepage is Google Reader and i guess i switched from netvibes sometimes around new years even.

I can imaging switching my browser home page to some other feed based aggregator (may something built using the new Google feed api – yet another post) but i don’t think my homepage will ever be a traditional web page again.

Personalized homepages: netvibes, pageflakes, iGoogle

Personalized homepages are actually the most mainstream showcase for the decoupling, delamination, deportalization of the content i described in the section above. The protagonists in the personalized homepage area are startups like netvibes and pageflakes, but Google folowed suit with its personalized homepage product called iGoogle.

Currently iGoogle is an option to the classical google homepage, but exposing it to all google search users by making it the default is just flicking a switch (ok, and adding a couple of thousand servers to the computing grid).

Basically a personalized homepage is a toolbox with the following characteristics:

A user can customize his personal information space by adding /deleting so called widgets resp. gadgets to the homepage.
Widgets/Gadgets are tiny applications. The most important widget is a generic feedreader that is able to display an arbitrary RSS/ATOM feed. Other important applications are widgets for e.g Search, Mail, Video, Date & Time, Sudokus etc.
There is a directory of predefined and user supplied widgets where the user can choose from
The set of can be divided into logical pages by using tabs and the layout within a page/tab can be controlled by the user
Users are not only able to share widgets via the directory but also send them to friends via mail.
Sharing and sending is not only possible for single widgets but also for aggregations like tabs (called pagecasts in pageflakes) or tab collections (called universe in netvibes). (iGoogle is currently missing the ability to share and send aggregations of widgets
There is an API/SDK that allows everybody to build new widgets and add them to the director
The appearance of the home page can be styled to varying degree

In summary, personalized homepages are very flexible applications that allow for flexible aggregation and sharing of all kind of contents. But they still use the classical page metaphor for presenting the content. In that sense they are very good for giving an overview, but they still require one to click on a link in order to read the content. The screenshot of the WeltOnline Universe on Netvibes gives for me the most striking example for the possibilities discussed in Jeff Jarvis “After the page” post.

Feedreader

Being able to read large amounts of news more efficiently was the main reason to switch to a dedicated feed reader.

I currently subscribe to around 175 feeds in Google Reader (tagged with 11 tags: apple, design, eclipse, gadgets, geo, google, mediablogs, mobile, news, spiegel-online, technology) Of these tags I regularily (i.e if possible once a day) read : apple, design, geo, google, mediablogs and technology, around 100 feeds.

My current statistics page shows that i’ve read/skimmed about 7,500 feed entries in the last 30 days and shared about 475 of them. This would have been not even remotely possible if i had to go the homepages of the different news providers. The river of news metaphor (together with the endless page implementation of Google Reader) allows me to be one order of magnitude more productive reading news than any other interface i’ve used prior to that.

Furthermore, the Google Reader (and a number of other readers) make subscribing to a feed as easy as bookmarking a web page. So if i follow a link on a feed entry, chances are high that the linked page also contains one or more feeds of the autor/publisher of this page. If i’m interested in the content on that page and think that it might be interesting to read more from this author/publisher, all i have to do is to:

click on a bookmarklet link in my browser,
skim/read through the other articles of the feed
decide if i want to subscribe, and if so which labels/tags i want to attach to the feed

All in all steps 1 and 3 take about 5 seconds. The skimming reading takes a bit longer in order to make a informed decision. But not very much longer because the original link to the page was on a feed which i trust/find interesting and not on some random web ressource.

I guess that at most 25 percent of the 100 feeds i read regularily are from main stream media. This is not due to the fact that i dislike MSM journalism (i’m working for it), but more to the fact that, besides general news, competent information in the areas i really care about is seldomly provided by MSM. And thats not their fault. This is because i’m an expert, or at least i’m more knowledgeable than most journalists in the niches i really care about. So naturally i’m more interested on things that other experts in that niche are writing/thinking. But i’m also interested in the main stream media’s interpretation and i wan’t to have it side by side. Feedreaders allow me to do so.

Another observation is that most of the feeds i’m reading are full text feeds, and most of the feeds that aren’t are from MSM. In general, a non full text feed must overcome a really high hurdle in terms of relevance before i’m willing to accept that additional hurdle to click on a link just to read the remainder of the article and subscribe to the feed. Chances are growing if the articles are written in a newsagency style (i.e. all major information in the teaser text and the link will provide me with additional details).

I think that there is only one feed left in my regular reading list that is headline only. It is the feed provided by my former browser homepage. Whereas 1 year ago i was reading all their articles via their homepage, today i’m reading a tenth of their articles at most. The additional hassle for clicking on the headline and waiting for the article page just to notice that most often it is old news to me just isn’t worth it. So they more or less lost me as a reader/ ad consumer/customer by only providing headlines only in their feed. I constantly think about writing a simple screenscraper for building my own fulltext feed but would rather see it from the publisher itself.

Feeds and the content used by Google News

After presenting the two most mainstream feed applications it is time to come back to the relation between feeds and the content used by Google News.

As far as i can tell, the following two points lie at the core of the MSM complaints that lead to the ACAP initiative as well as the various legal activities wrt. Google and other search engines, that recently got another bolt by the new owner of the Tribune Company Sam Zell:

Google News uses newspaper content without paying for it (i.e. it’s stealing it)
Google Search caches the content providers pages

In the following I’m going to focus on the first point. And i’m not going to argue that the newpapers should be happy that Google News is driving traffic to them etc. That has been said often enough. Instead i want to elaborate on a

Why MSM has to provide feeds containing the information Google News uses

So lets have a look at the content actually used within Google News. Its the headlines and a approx. 200 chars lead for the first of the news of a cluster. According to my knowledge this is more or less along the lines of what is considered fair use. and thats also Google standard response to the media companies complaining.

I also guess that google can easily finetune these parameters to the requirements of the varios local markets and content providers. At least thats’s what i would have done. Rem.: This has IMHO the direct consequence that the likelyhood for a content provider to be picked as the representative for a news cluster is directly correlated to the amount of text he is willing to expose.

The headline is also the absolute minimum of what a content provider has to deliver in order to provide an RSS/Atom feed. But, generalizing from my personal experiences, in order to attract readers in the long term, a successful feed at least has to contain a lead of the length google is using within google news. Hence main stream media has no chance NOT to expose the content that Google News is using. In order to fulfill the needs of their readers they have to provide feeds of their content, making it also easier for google to spider the content. At the moment there may be only a few readers raising their voices, but with every reader being exposed to the benefits of feeds the call for feeds is getting louder.

Even if they don’t, no problem for Google News

So MSM might have the idea to restrict the the usage via the T & C ‘s of the feed (or robots.txt or sitemaps). I already noted that Google has good chances to argue that the use of that amount of information is fair use and hendce may neglect the provided information.

Even if Google obliges and does not spider the feeds where it is told to do so AND the majority of the main stream media joins and stands firm in their opinion to not allow spidering, that may hinder Google from harvesting THESE feeds and adding them to the news clusters.

But Google already has reached agreements with AP, AFP and presumably some other newswires and content providers. Having two of the worlds four biggest newswires on board alone provides enough information for a sufficient news flow to be able to have at least one story in every news cluster that has some relevance to a broader audience.

In fact both AFP and AP in their statements related to the agreement put emphasize to the fact that the agreement is not related to Google News but to a new undisclosed Google Product. IMHO, this should leave media rather disturbed than calmed, because i very rarely have seen a Google product successor that was worse than its predecessor :-) I’ll give you some hints into what direction i think this new product is headed in the next section.

But what if the newswires also step out of the deal. AP for example is owned by the mainstream media (same for the company that (indirectly) employs me: dpa, the german newswire)

Two answers to that:

Newswires are wholesellers, selling not only to main stream media but also new media portals like Yahoo!, T-Online etc. It is very unlikely that these companies will deny spidering therir news feeds.
Newswires are most often monopolists. Hence even if Google is not able to spider any news wire content (directly or indirectly), and the news wires are not willing to sell their content to Google, as a last resort i guess that some legal action by google would force the news wire to sell their content. And i would guess that money wouldn’t be a problem.

In addition, a lot of companies from the new media are more than willing to fill the “void” opened up by the missing main stream media. Especially local news are going to be provided by blogs and hyperlocal news outlets like placeblogger and Adrian Holovaty’s new venture EveryBlock. And for all the main stream media people telling me things about quality journalism thats not going to be provided by companies like these i only have the following answer: There is a reason why both won a Knight 21st Century News Challenge Award.

So even if all the main stream media boycotts Google News the world-wide, national and regional news will be covered by the news agencies and or the ISP and new media portals and the local news will be covered by new media. So if somebody in main stream media still has the idea that they may force Google to shut down Google News they should stop thinking about it.

Why Google Reader and iGoogle are the better Google news

We have already seen that especially iGoogle and other personalized homepages can be used to build webpages that resemble homepages of newspapers. In this section i’ll have a closer look at further possibilities for personalization especially collaborative filtering and the possibilities to do so based on Google News vs. the possibilities using feed based approaches.

On personalization and collaborative filtering

I just finished reading a scientific paper presented at this years W3C conference called: Google News Personalization: Scalable Online Collaborative Filtering. It’s definitely tech hard-core (Some friends from my AI times definitely better understand PLSI (Probabilistic Latent Sematic Indexing) and the likes better than i do but i think i got the idea pretty well).

So i spare you the details and just give you the abstract and my translation/interpretation of the remainder:

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

I.e. Google has invented some new algorithms (or cleverly combined some existing algorithms, doen’t matter), that are able to do the same for news that Amazon does for products: providing recommendations what might be interesting . In order to do so they use the click history of the Google News user, i.e a list of the news items a user has clicked on in order to read it. They than assume that a user only clicks on news items that he at least finds (superficially) interesting and use the click information to compute the basic data-structures for the information can be built.

I’m not sure if the algorithms need the actual news text to work. From the first reading i don’t think so. At least they don’t have to expose the news item content to the end user. In any case it should be sufficient to just use the actual news content once in order to populate and update the basic data structures used to compute the recommendations. (Sure it’s better to have the original content around in order to be able to rebuilt the indices.)

For both computing the basic data-structures and for generating the recommendations in a timely manner, they use Google’s massive computing infrastructure. My guess would be that they used a couple of hundred to a couple of thousands computer to calculate the indices and generate the recommendations

Personalization and Google Reader

As far as i understand it is trivial to use the same algorithms using the data gathered with Google Reader. With the exception that both the amount and the quality of data for computing the recommendations is at least an order of magnitude better than with Google News.

As you can see from the statistics in the personal experience section above, instead of a simple click history as the base data, Google Reader provides Google with accurate data which articles have been read (at least marked read) by the user and when. In addition if the users uses tags for sorting his feeds like i do this information is also readily available for clustering.

If the user is using the river of news frontend, Google is also to be able to get (approximately) the time the user spent reading the story and from this parameter is able to check if he actual read it or just skipped it, and taken into account the (average) reading speed also can make an educated guess how much of an article the user actually read.

AFAIK this is information all publishers would be very happy having (and are desperately and costly seeking to gather from their printed publications). Where as they at most may do a so called reader scan one every couple of months (with a selected audience of a couple of hundred reader) a feed reader gives them the opportunity to have this data in (near) real time. But instead of seeing this opportunity they view feed readers as a thread they have to protect their content from.

This is where the real value of feed readers lies. But far i’ve only seen one manifestation that a main stream media house has to some extent seen this opportunity and is providing their own feed reader to the public.: Eldono, a side project of Axel Springers bild.t-online.de started by Georg Pagenstedt.

But this information is just the starting. Google Reader also gives the readers the ability to star and shared the news items they read. Whereas i personally don’t do starring, sharing was the main reason to choose Google Reader over other browser based feed readers that use the river of news metaphor.

Starring is the ability to vote on the quality/relevance/whatever of individual article. Typically a user can rate the article with 0 to 5 stars. Sharing is basically the same (but on a binary scale) either share an article or not. Google automatically generate a feed of the shared items, so people can easily subscribe to the articles that i find interesting. Yet another great use of feeds.

Both starring and sharing are the basis of the voting mechanisms used to drive digg, newsvine and other user generated news portals. So having this information from Google Reader this enables Google to compete with these outlets if they wish to. My guess would be that Google right now accumulates at least the same amount of votes and shares that digg gets.

In addition the missing ability to star is the major differentiator (besides the churn rate of the items to vote on) that is mentioned in the Google News Research paper, and its major shortcoming.

Treating clicks as a positive vote is more noisy than accepting explicit 1-5 star ratings or treating a purchase as a positive vote, as can be done in a setting like amazon.com. While different mechanisms can be adopted to track the authenticity of a user’s vote, given that the focus of this paperis on collaborative filtering and not on how user votes arecollected, for the purposes of this paper we will assume thatclicks indeed represent user interest.

While clicks can be used to capture positive user interest, they don’t say anything about a user’s negative interest. This is in contrast to Netflix, eachmovie etc. where users give a rating on a scale of 1-5.

Personalization and iGoogle

Like a “classical feedreader” the personalized homepage also tracks which feed items are read and thus is at least as capable for feed / and or content recommendation as the Google News personalization presented above. Further it is only a minor addition to the generic feedreader gadget to expose the capability to star and share feeds and items. Hence it is easy for Google to also harvest this information and use it.

Additional information can be gained from the sharing and sending of widgets and widget aggregations which provide for easy user clustering in addition to the complex user clustering processes described in the research paper while simultaeously providing a cluster hierarchy (e.g. widget, tab, tabcollection).

Outlook: Data Integration and external APIs, further feed based apps, content integration at the client side

As far as i can tell, right now the subscription and read items databases are not shared between Google Reader and iGoogle but it should be possible for Google to do so without too much hassle. Thsi would allow to seamlessly use the information gathered at one app e.g. Google Reader in other apps e.g. iGoogle. This is true on an individual account levele as well as on an aggregated level generalizing on user clusters. For example it is trivial to generate tab pages in an users iGoogle homepage based on the tags used in Google Reader. That makes it possible to get an overview over the recent news using iGoogle and the have a detailed reading of the intersting articles using Google Reader.

Another way to foster interapp integration is to expose an API that allows users to synchronize the parts of the feed reading history they want to synchronize not only between Google Apps but also with other applications. There are signs that Google is following that road with the Google Feed API but i couldn’t find the easy way for 3rd Party synchronisation with this API as Niall Kennedy suggested. But for the time being his deconstruction of the api used by google can be used.

The Google Feed API can definitely be used to built webpages that resemble traditional webpages but are being built using the Feed API, Javascript and some HTML and CSS glue code. as can be seen in the Feed API Playground. So Google enables their user to come up with new ideas how to use feeds by providing the appropriate building blocks, their classical scheme. If something interesting happen they either employ the people, built something similar or buy the company.

But i think that Google is using their infrastructure not only to take advantage in the traditional web. Good recommendation engines etc. are even more importeant in a time, screenspace and bandwidth restricted space such as the mobile web. So i especially expect Google to launch a mobile version of iGoogle as the default homepage for the mobile web. This is also the palce wehre i see the agreements with the newswires kicking in. Either for directly using the content in this kind of context or for using it as the basis and training set for the news clustering and the recommendation engines.

Last but not least using Javascript and Ajax in the feed based applications gives Google a good escape route should the content provider argue that google is stelaing their content. As far as i can tell Google is right now tunneling the feed request through their systems and caching the results at Google’s premises. But besides a speed gain there is no real algorithmitical requirement to have the news content going through Googles systems in order to provide recommendations etc. The Javascript security model might be a hinderance but as i understand dynamic script tags or Flash based client side proxies may help with crossdomain request problems.

But what does this technical terms mean in a business sense. It means that if Google decides not to cache the feed request results (and AFAIK they don#t need to) they can argue that they do not hold any actual MSM content on their systems but just enable the end user to read the content that he wants to read.

The only information stored in Googles systems are the Urls of the content the user has clicked on , starred, shared whatever. The integration of the actual content is completely done at the client side at the end users request.

Summary

By not opening up their content to lightweight syndication, traditional content publishers will in the short and long term cut deeply into their own flesh. They will loose not only money by not doing so, but something of much higher value, the high degree of trust that their existing readership still has: their newspaper as a cornerstone and beacon for interesting, valuable and verified information.

If they don’t open up their content for syndication, others will do so or better: enable the end users to do so themselves (e.g. Yahoo! Pipes). So in the end the content will be available and the newspaper publishers will be in the same position as the music and film business. They can decide to join the MPAA and RIAA and sue the end customers and loose the trust the end users ~~still~~ have in them or they try to embrace feeds and lightweight syndication and find a way to make money out of it. And to make it clear: Trust in brands is much more important in the newspaper and information business than the music and film business.

And even if they do so, they will cause no real harm to Google and the other feed based aggregatro because they will be able to get the necessary informations anyway.

Using feeds it is possible to devise a whole range of compelling news applications and Google is going to exploit it big time while at the same time overcoming the legal issues with the content providers by using AJAX. And Google is not alone, there are a lot of other player willing to jump in if needed.

Using the information provided by the users of feeds based applications, enables the provider of the application to easily gather all kinds of reading statistics and other information that the traditional media misses big time. But not providing thee kind of applications

Based on the provided information Google and others are able to use sophisticated algorithms for user and story clustering and doing story recommendation etc. based on these clusters (more or less in realtime). This clustering and recommendation algorithms can then be used to built other more sophisticated web apps or apps in more constrained scenarios like the mobile web.

Moreover, using AJAX based solutions that allow for the integration of the content at the client side will allow to use this algorithms without actually having the need to store and process the actual content, the links to the content are in principle sufficient for the algorithms and platforms to work.

By neglecting feeds and all the possibilites that they bring with them, traditional news companies will fall even further behind Google and othes in the race for the eyeballs of the customers. With the solutions becoming more and more technically complex it will become ever more difficult for the traditional media companies to catch up by simply buying the technology.

So everyday a traditional media company does not think about how to embrace feeds is a lost day. I hope that i provided some kind of wake up call with this rather lengthy article.

What’s next

Since i still see a lot of question marks in the eyes of people when i’m talking about them the multitude of use cases that RSS/ATOM Feeds can be used for (or even when mentioning the terms feeds, RSS and ATOM), i decided to start a series in this blog showcasing them, a feedology so to say.

So shortly i will start a series of shorter post showcasing the different possible uses of feeds, starting with personalized homepages and browser based feed readers.

If you want to see another use case even before i started this series just move over to the MyGrazr page on this blog. It is an embeddable feedviewer written in javascript and provided as a free service by a company called grazr. This feedviwer uses my OPML file (i.e. a structured, machine readable list of the feeds that i’m reading) to have a look at the feeds that i’m reading.

Another use case that you might find interesting is my shared item feed. Here i’m using the sharing feature of the Google Reader to simultaneously mark the feed entries i find most interesting and share these findings with the outside world. So if you like to have a kind of mechanical turk sifting through a lot of blog posts, just have a look at the resulting feed, or if you’re not ready for feeds yet, you can also have a look at the plain old website generated from that feed.

Rem: This post has been sitting in a draft state since the end of 2006. I always wanted it to include some more information etc. But with the belgian appeals court ruling and the offcourt settlements, the introduction of Remixing tools like Yahoo! Pipes (as el jobso would say: “Insanely great! – this software deserves at least one separate post) and Microsofts Popfly it HAS to be finished or it is definitely either out of date or going to be the length of a diploma thesis.