links for 2009-03-06

  • Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. For more information about Tika, please see the list of supported document formats and the available documentation . You can find the latest release on the download page . See the Getting Started guide for instructions on how to start using Tika.
  • OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. Click here to see the current list of OpenNLP projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general.

    OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package. To start using these tools download the latest release here, and check out the OpenNLP Tools API. For the latest news about these tools and to participate in discussions, check out OpenNLP's Sourceforge project page.

  • LT World is the most comprehensive WWW information service and knowledge source on the wide range of technologies that deal with human language. The service is provided by the German Language Technology Competence Center at DFKI. Contents will constantly be improved. Please send corrections and pointers to missing information to
  • ICE steuert den Informationsfluss in Ihrem Unternehmen durch lernfähige KI-Komponenten, die Daten mit relevanten Zusatzinformationen anreichern. Die flexibel zu erweiternde Client-Server-Applikation umfasst eine breite Palette von Funktionen:

    * Vollautomatische Text-Kategorisierung auf KI-Basis oder einfach zu erstellenden Regeln
    * Extraktion von Eigennamen oder relevanten Details
    * Erkennung von Text-Clustern und Dubletten zur Daten-Bereinigung
    * Sprachenidentifikation und andere Vorverarbeitungen

  • GATE is…

    * the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, a leading toolkit for Text Mining
    * used worldwide by thousands of scientists, companies, teachers and students
    * comprised of an architecture, a free open source framework (or SDK) and graphical development environment
    * used for all sorts of language processing tasks, including Information Extraction in many languages
    * funded by the EPSRC, BBSRC, AHRC, the EU and commercial users
    * 100% Java reference implementation of ISO TC37/SC4 and used with XCES in the ANC
    * 10 years old in 2005, used in many research projects and compatible with IBM's UIMA
    * based on MVC, mobile code, continuous integration, and test-driven development, with code hosted on SourceForge

  • Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. Ellogon as a language engineering platform offers an extensive set of facilities, including tools for processing and visualising textual/HTML/XML data and associated linguistic information, support for lexical resources (like creating and embedding lexicons), tools for creating annotated corpora, accessing databases, comparing annotated data, or transforming linguistic information into vectors for use with various machine learning algorithms.