ARD.ZDF Medienakademie BigData

Gestern habe ich einen Vortrag über das BMBF geförderte Projekt Newsstream auf dem Symposium “Big-Data: Produktiver Mehrwert oder unberechenbare Datenflut?” gehalten.

Neben einer Vorstellung des dpa-newslab und einem Überblick über das Projekt bin ich insbesondere auf die Motivation des dpa-newslab und der dpa bei dem Projekt mitzumachen eingegangen und habe erste “Epics” und Demonstratoren vorgestellt.

Auch wenn die Folien sich ohne Tonspur evtl. nicht ganz erschliessen, hier sind sie:

I just noticed that one year after after ceasing operation on Dec. 7th 2012 the website of the german newspaper “Financial Times Deutschland” also has been shut down:


The domain itself is still registered to Gruner und Jahr.

After the announcement of the shutdown, i decided to use some spare time in order to scrape and archive  the site, but only found some time in January and February of this year. Luckily most of the site was still there and ready to scrape, including the only behind the paywall print archive of the publication.

Instead of just throwing an recursive wget at the site i decided to (re-)develop my python skills and learn a little about the web archive format WARC.

Hence i wrote some tailored scrapers and can report that i think i have fairly complete print archive ( 72186 PDF pages and 201234 text articles approx 21.4GB ) as well as most of the website content archived (HTML, images and audio)

I plan to use it for some personal research and report on this research on this very blog (and maybe a separate site).

Howto: Mavericks ISO Install Image

In order to be able to install Mavericks as a VMwareFusion 6 virtual machine i needed to built an ISO Install image of Mavericks.

I found this set of commands at which worked very well:

# Mount the installer image
hdiutil attach /Applications/Install\ OS\ X\ -noverify -nobrowse -mountpoint /Volumes/install_app
# Convert the boot image to a sparse bundle
hdiutil convert /Volumes/install_app/BaseSystem.dmg -format UDSP -o /tmp/Mavericks
# Increase the sparse bundle capacity to accommodate the packages
hdiutil resize -size 8g /tmp/Mavericks.sparseimage
# Mount the sparse bundle for package addition
hdiutil attach /tmp/Mavericks.sparseimage -noverify -nobrowse -mountpoint /Volumes/install_build
# Remove Package link and replace with actual files
rm /Volumes/install_build/System/Installation/Packages
cp -rp /Volumes/install_app/Packages /Volumes/install_build/System/Installation/
# Unmount the installer image
hdiutil detach /Volumes/install_app
# Unmount the sparse bundle
hdiutil detach /Volumes/install_build
# Resize the partition in the sparse bundle to remove any free space
hdiutil resize -size `hdiutil resize -limits /tmp/Mavericks.sparseimage | tail -n 1 | awk '{ print $1 }'`b /tmp/Mavericks.sparseimage
# Convert the sparse bundle to ISO/CD master
hdiutil convert /tmp/Mavericks.sparseimage -format UDTO -o /tmp/Mavericks
# Remove the sparse bundle
rm /tmp/Mavericks.sparseimage
# Rename the ISO and move it to the desktop
mv /tmp/Mavericks.cdr ~/Desktop/Mavericks.iso