Answer Type Detection for German Language

The second post of the NLPG project of cip-labs deals with data. Here is the data you need to train a classifier with the widely used taxonomy of Li and Roth [1] for the German language. This data was published in the context of the master thesis of Alexander Bresk [2]. This thesis also introduces a novel classifier that uses this data to train and evaluate a model for answer type detection. The taxonomy of Li and Roth, the provided data and the trained classifiers are presented in this article.

The taxonomy

Li and Roth invented a taxonomy that grew to a standard in answer type detection for a long time. The taxonomy covers 6 master classes and 50 sub classes. The 6 master classes are:

  • Abbreviation (2)
  • Description (4)
  • Entity (22)
  • Human (4)
  • Location (5) and
  • Numeric (13).

The numbers in brackets depicts the number of sub classes for each master class. These classes can be used to cover most cases of answer types acc. to Li and Roth.

The data

The presented data consists of 1548 questions in the training set and 735 questions in the test/evaluation set. For all master and sub classes the number of samples is uniformly distributed.

  • Master classes (Train #1548 / Test #735)
  • Abbreviation (Train #72 / Test #30)
  • Description (Train #141 / Test #80)
  • Entity (Train #678 / Test #313)
  • Human (Train #119 / Test #60)
  • Location (Train #150 / Test #75)
  • Numeric (Train #388 / Test #177)

This data was labeled the following way in the master classes file:

Was ist die Hauptstadt von Deutschland?#LOCATION

In the sub classes files the same question is labeled as follows:

Was ist die Hauptstadt von Deutschland?#LOC_CITY

The annotation for the sub classes consists for the first three tokens of the master class followed by an underscore and the name of the sub class.

Current State of the Art

Currently, the state of the art in answer type detection uses models that try to find correlations in semantic networks to determine the answer type. Research activities from 2015 [3] tackle this approach. The model of Li and Roth is, of course, still used by a lot of researcher. Newer systems are solely using semantic approaches.

Download

Here you can download the data to train your own answer type detector for the German language.

Sources

[1] Li, Xin, and Dan Roth. “Learning question classifiers: the role of semantic information.” Natural Language Engineering 12.03 (2006): 229-249.

[2] Alexander Bresk. “Question Answering using Unstructured Data”. HTW Dresden. 2015

[3]  SUN, H., MA, H., YI, W., TSAI, C., LIU, J. and CHANG, M. Open domain ques- tion answering via semantic enrichment. In Proceedings of the companion publication of the 24th international conference on World Wide Web. ACM – Association for Computing Machinery, May 2015. URL http://research.microsoft.com/apps/ pubs/default.aspx?id=241399.