I’ve been accepted to Google Summer of Code 2014 with my proposal for Apache Stanbol . Apache Stanbol provides a set of reusable components for semantic content management. Apache Stanbol’s intended use is to extend traditional content management systems with semantic services. Other feasible use cases include: direct usage from web applications (e.g. for tag extraction/suggestion; or text completion in search fields), ‘smart’ content workflows or email routing based on extracted entities, topics, etc. .
Currently a topic classification engine is being developed with STANBOL-197 . Implementation plan is supposed as using MoreLikeThis queries on a SolrYard instance with topics indexed by aggregating the text of abstracts of all entities marked categorized by a given SKOS topic from Dbpedia and performing a more like this query at that issue . However it uses Solr based classifier and no work was done regarding OpenNLP  or Apache Mahout . STANBOL-1294  aims to have alternative implementations for a topic classifier for example known resources like Apache Mahout  or OpenNLP .
My proposal is for developing a topic classification engine for Apache Stanbol  that uses Solr, Mahout and OpenNLP. I’will refactore current API, I’ll change current TopicClassification engine for working with the new APIs and also will work for different implementations for managing the TrainingSet.
You can check my proposal at Google Melange web site . I’ll try to share my experiences during GSoC 2014 period.