Could not resolve host: www.ftd.de


I just noticed that one year after after ceasing operation on Dec. 7th 2012 the website of the german newspaper “Financial Times Deutschland” also has been shut down:

 

curl http://www.ftd.de
curl: (6) Could not resolve host: www.ftd.de

The domain itself is still registered to Gruner und Jahr.

After the announcement of the shutdown, i decided to use some spare time in order to scrape and archive  the site, but only found some time in January and February of this year. Luckily most of the site was still there and ready to scrape, including the only behind the paywall print archive of the publication.

Instead of just throwing an recursive wget at the site i decided to (re-)develop my python skills and learn a little about the web archive format WARC.

Hence i wrote some tailored scrapers and can report that i think i have fairly complete print archive ( 72186 PDF pages and 201234 text articles approx 21.4GB ) as well as most of the website content archived (HTML, images and audio)

I plan to use it for some personal research and report on this research on this very blog (and maybe a separate site).