A Note on Answer Type Detection using German Text

This short note, which was originally released on my private blog,  is a part of the preparations for my master thesis.  The thesis tries to propose a Question Answering system that pursues two goals:

  • Replace a static FAQ section with an input field to search in unstructured data
  • Act as a part of a dialog system for a service robot.

On (maybe the first in the pipeline) step of this system deals with detecting the correct answer type. So this note deals with my ideas and remarks regarding this topic (Answer Type Detection / ATD).


Answer Type Detection is about finding the type of an answer. Type is a very squishy wording. Why? Because every QA system has another task to solve. If you are searching for factoid answers, then the „type“ often means a data type (date, string, number, etc.) For a more complex system, the step of Answer Type Detection has to be executed very accurately. Imagine a set of documents and you’re searching for the answer. Then ATD (assumed that you’re results are trustworthy) helps to:

  • Locate the answer in your documents
  • Chooses the search & evaluation strategy


There are a few other studies that are dealing with this topics. Some of them were introduced below, but other works are Zhang and Lee, 2003; Huang et al., 2008; Silva et al., 2011; and Loni et al., 2011; Huang et al., 2008; Blunsom et al., 2006; Li and Roth, 2004; as well as Merkel and Klakow, 2007.


Davidescu et al.[1] tries to evaluate some interesting features for Answer Type Detection. The team worked with a german corpus (data was crawled from web sources) and 7 answer types (Defintion, Location, Person, Yes/No, Explanation, Date, Number/Count). The approach is very similar to my work on Answer Type Detection, but I used a larger training/test-set as they did. The paper discusses the following features:

  • Collocations
  • Trigger words
  • Punctuation marks (a very weak feature, note AB)
  • Question length (also a very weak feature, note AB)
  • Named entities (Lingpipe)
  • Lemmas (Tree Tagger)
  • POS-tags (Tree Tagger)
  • Bag of Words
  • Sentence patterns

They used three well-known classifiers for the tests (Naïve Bayes, kNN and Decision Trees). My learning from this paper was:

  • NER Features don’t improve the system significantly
  • Bag of Words improves the result badly (see paper for more info)
  • Statistical Collocation also improved the results (see paper for more info)

The results fluctuate between 54% –65% accuracy. The best result was reached with the Baseline+StatisticalColloc+BagOfWords feature set using Naïve Bayes classifier.


The approach of Kirk Roberts and Andrew Hickl[3] shows a hierarchical classifier for scenarios with up to 200 classes (including sub classes). In „Scaling Answer Type Detection to Large Hierachies“ they defined Coarse Answer Types and Fine Answer Types.


This approach (that is also explained in other papers from the same as well as other authors, see secondary papers and publications below) outperforms some approaches that are using flat classifiers (by up to 7%) in the same scenario. I also had the insight to use hierarchical (cascaded) classifiers from previous task in text classification. Some interesting numbers of the approach:

  • Reach precision of 85% with 200 classes
  • Test/training set containing 20,000 rows


Barak Lonis master thesis[6] deals with the enhancement of feature sets and especially combining features to improve the performance. The approach uses neural networks and SVM. My approach also uses neural networks and SVM with a relatively similar understanding of features: lexical, syntactic and semantic.
Lexical Features are unigam, bigram, trigram, wh-word, word-shapes and question-length. My approach also uses all of these features. The feature space of his syntactic features are tagged-unigrams, POS tags, headword and head rule (my approach also uses all of these features with different implementations). At least the semantic features are headword-hypernyms, related words (using WordNet), question-category (also using WordNet) and query expansion (also using WordNet). I’m using or plan to use similar features with the german equivalent GermaNet and SALSA (which can be seen as the german FrameNet).

The results vary, but the best result recorded was reached with the back propagation neural network and the „WH+WS+LB+H+R“ feature set 93,8% (coarse classes). The data set consists of 6 coarse classes and 50 fine classes (TREC set). The key essence of this work (for me) is the reduction of feature dimensions from up to 14694 down to 400 dimensions. (Note: This chapter needs further investigations)


To treat this section exhaustively we have to consider the big picture. Beside the topic of QA (Question Answering) exists another topic – the UQA (Universal Question Answering). The raison d’être (or the reason to exist) for UQA comes from the view that QA is just a factoid task. But in most cases questions don’t expect a fact. For sure, many ordinary questions are requesting for a person, location, date in time and so on. Junta Mizuno and Tomoyosi Akiba[2] from Toyohashi University investigated Answer Types that are more than just factoid-based (and they called it UQA). The considered classes are Fact, Method, Opinion, Experience, Reason, Definition, Possibility, Illegal, Factoid, Multiple Types (complex) & Other – ordered in descending order sorted by the frequency of appearance. For their studies they collect a set of 1,187,873 questions.

Another consideration is the subjectivity of a question. What should a system answer, when the questioner asks „What is the prettiest place in germany?“ or to use an example of my target language „Was ist der schönste Platz in Deutschland?“ ? For sure, the system can develop an own taste but this requires an internal rater that uses an algorithm to rate the content. This isn’t a part of ATD, so we disregard this topic at this point. My goal was just to sensitize that a question can ask for a objective answer or a subjective impression.

Another problem is the „Explain vs. Boolean“-competition. During the creation of my answer type model I was considering about yes/no questions. Every yes/no questions expects a boolean value (e.g. „Kann man von Dresden nach Tübingen fahren, ohne umzusteigen?“) but some of these yes/no questions also need an explanation or advanced answer. This is the „explain“ part of this answer. It is hard to find out, which question just needs a yes/no and which question wants an explanation. My current solution is that the fallback should be an explanation (when no hard yes/no decision was made) and the cream topping is a yes/no decision.


The dataset I’m currently working with contains 8904 questions (recorded from frag.wikia.de, labeled manually by me). I’ve splitted this data with 80% for training and 20% for testing with cross-validation.

For my current approach I’m using 7 classes: DATETIME, ENTITY, LOCATION, PERSON, NUMBER, EXPLAIN and NAME. This classes are not disjunct subsets. This comes with the ambiguity of understanding a question. Below I will try to sketch the classes from special class up to the fallback class.





These classes are sufficient to solve the tasks of my thesis. The fallback (e.g. no location found, but entity found) allows the AT-classifier to increase the search range to still find the right answer. Another reason to implement this fallback is the subsequent use of the system for a dialog system. To guide though a dialog the system should also (in reason of uncertainty) return sentence that contain the right answer (or a hint for that).


With the setup explained above I’ve developed the V-perception approach, where the features were classified on various perception levels. The idea for the name came from reading some books about neuroscience and the visual cortex. Features where set with {-1,1} for SVMs and {0,1} for NN.


The first level is the basic of feature generations. Here I’ve used Known Entities (e.g. „Stadt“ → for location) and so on. Against some approaches I’ve limited the dimension of this KE-feature set. For each class I’ve just reserved 3 dimensions. When 3 Known Entities for a specific class occur in the question, all of these 3 dimensions are activated (e.g. just 1 KE of class x found, then just one dimension is activated.). The reason for this decision is the wasted number dimensions and therewith the duration of calculation.

In previous work I found out that is a good strategy to tackle text classification tasks with „start with word“, „contains word“ and „ends with word“. This features are also a part of my V1 feature set. This includes the search for question words and words that are helping question words.


The grammar level tries to find possible grammar patterns that the various types of questions have. The current approach uses the first 6 words of the question to find patterns. For annotation I’m using the Stanford POS Tagger which uses the STTS-tag set (Stuttgart Tübingen Tag Set). In addition to that I’m searching for sentence patterns in the question.


Sources [4] and [5] show the role of semantic features for ATD tasks and for QA tasks in general. I’m currently evaluating the GermaNet and SALSA (german FrameNet equivalent). SALSA will be then used to annotate semantic frames to extract the intension of a specific question. GermaNet can be used to extend the approach of Known Entities with the synonyms of the nouns and verbs used in the question.


With the features and dataset which were introduced above I’ve used SVMs and NN to test my feature sets. A short not exhaustive look at the results:

SVM-7,0-Poly-2,5-5,0-8,0: 96,856% acc.
NN-800-1-80: 94,554% acc.
Naïve Bayes-7: 87,76% acc.
SVM-7,0-RBF-0,05: 72,993% acc.

It seems that the feature set used for this approach outperforms other state-of-the-art approaches (especially using SVM). My assumption is that the features can be powerful, but the system needs an excellent preprocessing of all extracted features to perform well.


Some further work and things to-do were listed below:

  • Test word-shape features from [6]
  • Test freebase as source of possible features
  • Use maps application as source for location based questions
  • GermaNet extension to with labeling of classes (test with NN, NE, VV)
  • SALSA annotation of question (semantic role labeling)



[1] Davidescu et al.; Classifying German Questions According to Ontology-Based Answer Types; Spoken Language Systems, Saarland University

[2] Junta Mizuno and Tomoyosi Akiba; Non-factoid Question Answering Experiments at NTCIR-6: Towards Answer Type Detection for Real World Questions; Toyohashi University of Technology

[3] Kirk Roberts and Andrew Hickl; Scaling Answer Type Detection to Large Hierarchies;Language Computer Corporation

[4] Xin Li and Dan Roth; Learning Question Classifiers: The Role of Semantic Information; University of Illinois at Urbana-Champaign

[5] Dan Shen, Mirella Lapata; Using Semantic Roles to Improve Question Answering; Spoken Language Systems Saarland University, School of Informatics University of Edinburgh

[6] Babak Loni; Enhanced Question Classification with Optimal Combination of Features; Department of Media and Knowledge Engineering, Delft University of Technology; 2011


E. Brill, S. Dumais, M. Banko. 2002. An analysis of the askMSR question-answering system. In Proceedings of the EMNLP, 257–264, Philadelphia, PA.
X. Carreras, L. Ma`rquez, eds. 2005. Proceedings of the CoNLL shared task: Semantic role labelling, 2005.
T. Cormen, C. Leiserson, R. Rivest. 1990. Introduction to Algorithms. MIT Press.
H. Cui, R. X. Sun, K. Y. Li, M. Y. Kan, T. S. Chua. 2005. Question answering passage retrieval using de- pendency relations. In Proceedings of the ACM SIGIR, 400–407. ACM Press.
T. Eiter, H. Mannila. 1997. Distance measures for point sets and their computation. Acta Informatica, 34(2):109–133.
K. Erk, S. Pado ́. 2006. Shalmaneser – a flexible toolbox for semantic role assignment. In Proceedings of the LREC, 527–532, Genoa, Italy.
C. Fellbaum, ed. 1998. WordNet. An Electronic Lexical Database. MIT Press, Cambridge/Mass.
C. J. Fillmore, C. R. Johnson, M. R. Petruck. 2003. Background to FrameNet. International Journal of Lexicography, 16:235–250.
D. Gildea, D. Jurafsky. 2002. Automatic labeling of se- mantic roles. Computational Linguistics, 28(3):245– 288.
T. Grenager, C. D. Manning. 2006. Unsupervised dis- covery of a statistical verb lexicon. In Proceedings of the EMNLP, 1–8, Sydney, Australia.
R. Jonker, A. Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325–340.
M. Kaisser. 2006. Web question answering by exploiting wide-coverage lexical resources. In Proceedings of the 11th ESSLLI Student Session, 203–213.
J. Leidner, J. Bos, T. Dalmas, J. Curran, S. Clark, C. Ban- nard, B. Webber, M. Steedman. 2004. The qed open- domain answer retrieval system for TREC 2003. In Proceedings of the TREC, 595–599.
C. Leslie, E. Eskin, W. S. Noble. 2002. The spectrum kernel: a string kernel for SVM protein classification. In Proceedings of the Pacific Biocomputing Sympo- sium, 564–575.
B. Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago.
X. Li, D. Roth. 2002. Learning question classifiers. In Proceedings of the 19th COLING, 556–562, Taipei, Taiwan.
D. K. Lin. 1994. PRINCIPAR–an efficient, broad- coverage, principle-based parser. In Proceedings of the 15th COLING, 482–488.
D. Moldovan, C. Clark, S. Harabagiu, S. Maiorano. 2003. COGEX: A logic prover for question answer- ing. In Proceedings of the HLT/NAACL, 87–93, Ed- monton, Canada.
S. Narayanan, S. Harabagiu. 2004. Question answering based on semantic structures. In Proceedings of the 19th COLING, 184–191.
S.Pado ́,M.Lapata.2006.Optimalconstituentalignment with edge covers for semantic projection. In Proceed- ings of the COLING/ACL, 1161–1168.
M. Palmer, D. Gildea, P. Kingsbury. 2005. The Propo- sition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.
D. Paranjpe, G. Ramakrishnan, S. Srinivasa. 2003. Pas- sage scoring for question answering via bayesian infer- ence on lexical relations. In Proceedings of the TREC, 305–210.
S. Pradhan, W. Ward, K. Hacioglu, J. Martin, D. Jurafsky. 2004. Shallow semantic parsing using support vector machines. In Proceedings of the HLT/NAACL, 141– 144, Boston, MA.
D. Shen, D. Klakow. 2006. Exploring correlation of de- pendency relation paths for answer extraction. In Pro- ceedings of the COLING/ACL, 889–896.
R. X. Sun, J. J. Jiang, Y. F. Tan, H. Cui, T. S. Chua, M. Y. Kan. 2005. Using syntactic and semantic re- lation analysis in question answering. In Proceedings of the TREC.
B. Taskar, S. Lacoste-Julien, D. Klein. 2005. A discrim- inative matching approach to word alignment. In Pro- ceedings of the HLT/EMNLP, 73–80, Vancouver, BC.
M. Wu, M. Y. Duan, S. Shaikh, S. Small, T. Strzalkowski. 2005. University at albany’s ilqua in trec 2005. In Proceedings of the TREC, 77–83.
T. Akiba, A. Fujii, and K. Itou. Question answer- ing using “common sense” and utility maximization principle. In Proceedings of The Fourth NT- CIR Workshop, 2004.
J. Fukumoto, T. Kato, and F. Masui. Question answering challenge (QAC-1) question answering evaluation at NTCIR workshop 3. In Proceedings of The third NTCIR Workshop, 2003.
J. Fukumoto, T. Kato, and F. Masui. Question an- swering challenge for five ranked answers and list answers – overview of NTCIR 4 QAC2 subtask 1 and 2 –. In Proceedings of The Fourth NTCIR Workshop, 2004.
GETA. transposable Generic association engine for (GETA). http://geta.ex.nii.ac.jp.
T. Kato, J. Fukumoto, and F. Masui. An overview of NTCIR-5 QAC3. In Proceed- ings of The Fifth NTCIR Workshop, 2005. http://research.nii.ac.jp/ntcir/workshop/ OnlineProceedings5/data/QAC/NTCIR5-OV-QAC- KatoT.pdf.
A. Tamura, H. Takamura, and M. Okumura. Clas- sification of multiple-sentence questions. In Pro- ceedings of the 2nd International joint Conference on Natural Language Processing, 2005.
WAHLSTER, Wolfgang (2004): SmartWeb: Mobile applications of the Semantic Web. In: Adam, Peter / Reichert, Manfred, editors. INFOR- MATIK 2004 – Informatik verbindet, Band 1. Beitr ̈age der 24. Jahrestagung der Gesellschaft fr Informatik e.V. (GI), Ulm. (Web: www.smartweb-project.de)
CRAMER, Irene / LEIDNER, Jochen L. / KLAKOW, Di- etrich (2006, to appear): Building an Evaluation Corpus for German Question Answering by Harvesting Wikipedia. Proceedings of The 5th International Con- ference on Language Resources and Evaluation, Genoa, Italy.
LI, Xin / ROTH, Dan (2002): Learning Question Classifiers. COL- ING’02.
LI, Xin / HUANG, Xuan-Jing / WU, Li-de (2005):Question Clas- sification using Multiple Classifiers. Proceedings of the 5th Workshop on Asian Language Resources and First Symposium on Asian Language Resources Network.
ZHANG, Dell / LEE, Wee Sun (2003): Question classification using support vector machines. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Cananda.
DAY, Min-Yuh / LEE, Cheng-Wei / WU, Shih-Hung / ONG, Chorng-Shyong / HSU, Wen-Lian (2005): An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification. IEEE Interna- tional Conference on Natural Language Processing and Knowledge Engineering.
SONNTAG, Daniel /ROMANELLI, Massimo (2006, to ap- pear): A Multimodal Result Ontology for Integrated Semantic Web Dialogue Applications. Proceedings of the 5th international conference on Language Re- sources and Evaluation, Genoa, Italy.
VOORHEES, Ellen (2001): Overview of the TREC 2001 Ques- tion Answering Track. Proceedings of the 10th Text Retrieval Conference, NIST, Gaithersburg, USA.
WITTEN, Ian H. / FRANK, Eibe (2000): Data Mining – Prac- tical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers. (Web: http://www.cs.waikato.ac.nz/ ml/weka)
DUDA, Richard O. / HART, Peter E. / STORK, G. (2000): Pattern Classification. New York: John Wiley and Sons.
MITCHELL, Tom M. (1997): Machine Learning. Boston, MA: Mc- Graw Hill.
Abney, S. P. 1991. Parsing by chunks. In S. P. Abney R. C. Berwick and C. Tenny, editors, Principle-based parsing: Computation and Psycholinguistics. Kluwer, Dordrecht, pages 257–278.
Carlson, A., C. Cumby, J. Rosen, and D. Roth. 1999. The SNoW learning architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department, May.
Even-Zohar, Y. and D. Roth. 2001. A sequential model for multi-class classification. In Proceedings of EMNLP-2001, the SIGDAT Conference on Empirical Methods in Natural Language Processing, pages 10–19.
Fellbaum, C., editor. 1998. WordNet: An Electronic Lexical Database. The MIT Press, May.
Learning Question Classifiers: The Role of Semantic Information 21
Hacioglu, K. and W. Ward. 2003. Question classification with support vector machines and error correcting codes. In Proceedings of HLT-NAACL.
Harabagiu, S., D. Moldovan, M. Pasca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Girju, V. Rus, and P. Morarescu. 2000. Falcon: Boosting knowledge for answer engines. In E. Voorhees, editor, Proceedings of the 9th Text Retrieval Conference, NIST, pages 479–488.
Hermjakob, U. 2001. Parsing and question classification for question answering. In ACL- 2001 Workshop on Open-Domain Question Answering.
Hirschman, L., M. Light, E. Breck, and J. Burger. 1999. Deep read: A reading com- prehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 325–332.
Hovy, E., L. Gerber, U. Hermjakob, C. Lin, and D. Ravichandran. 2001. Toward semantics- based answer pinpointing. In Proceedings of the DARPA HLT conference.
Lee, L. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 25–32.
Lehnert, W. G. 1986. A conceptual theory of question answering. In B. J. Grosz, K. Sparck Jones, and B. L. Webber, editors, Natural Language Processing. Kaufmann, Los Altos, CA, pages 651–657.
Li, X. and D. Roth. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Compuatational Linguistics (COLING), pages 556–562. Li,
X., K. Small, and D. Roth. 2004. The role of semantic information in learning question classifiers. In Proceedings of the First Joint International Conference on Natural Language Processing.
Littlestone, N. 1989. Mistake bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, U. C. Santa Cruz, March.
Moldovan, D., M. Pasca, S. Harabagiu, and M. Surdeanu. 2002. Performance issues and error analysis in an open-domain question answering system. In Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics, pages 33–40. Pantel, P. and D. Lin. 2002. Discovering word senses from text. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Pinto, D., M. Branstein, R. Coleman, M. King, W. Li, X. Wei, and W.B. Croft. 2002.
Quasm: A system for question answering using semi-structured data. In Proceedings of the Joint Conference on Digital Libraries.
Punyakanok, V. and D. Roth. 2001. The use of classifiers in sequential inference. In Proceedings of the 13th Conference on Advances in Neural Information Processing Systems, pages 995–1001. MIT Press.
Radev, D. R., W. Fan, H. Qi, H. Wu, and A. Grewal. 2002. Probabilistic question answering from the web. In Proceedings of WWW-02, 11th International Conference on the World Wide Web.
Roth, D. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proceedings of 15th National Conference on Artificial Intelligence (AAAI).
Roth, D., C. Cumby, X. Li, P. Morie, R. Nagarajan, N. Rizzolo, K. Small, and W. Yih. 2002. Question answering via enhanced understanding of questions. In Proceedings of the 11th Text Retrival Conference, NIST, pages 592–601.Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Singhal, A., S. Abney, M. Bacchiani, M. Collins, D. Hindle, and F. Pereira. 2000. AT&T at TREC-8. In E. Voorhees, editor, Proceedings of the 8th Text Retrieval Conference,NIST.
Voorhees, E. 2002. Overview of the TREC-2002 question answering track. In Proceedings of the 11th Text Retrieval Conference, NIST, pages 115–123.
Zhang, D. and W. Lee. 2003. Question classification using support vector machines. In Proceedings of the 26th Annual International ACM SIGIR conference, pages 26–32.
S. Harabagiu, D. Moldovan, M. Pasca, M. Surdeanu, R. Mihalcea, R. Girju, V. Rus, F. Lacatusu, P. Morarescu, and R. Bunescu. 2001. Answering Complex, List and Context Questions with LCC’s Question-Answering Server. In Proceedings of the Tenth Text REtrieval Conference.
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, and P. Wang. 2005. Employing Two Question Answering Systems in TREC 2005. In Proceedings of the Fourteenth Text REtrieval Conference.
Andrew Hickl, Patrick Wang, John Lehamnn, and Sanda Harabagiu. 2006a. Ferret: Interactive Question-Answering for Real-World Research Environments. In Proceedings of the 2006 COLING-ACL Interactive Presentations Session.
Andrew Hickl, John Williams, Jeremy Bensley, Kirk Roberts, Ying Shi, and Bryan Rink. 2006b. Question Answering with LCC’s Chaucer at TREC 2006. In Proceedings of the Fifteenth Text REtrieval Conference.
Andrew Hickl, Kirk Roberts, Bryan Rink, Jeremy Bensely, Tobias Jungen, Ying Shi, and John Williams. 2007. Question Answer- ing with LCC’s Chaucer-2 at TREC 2007. In Proceedings of the Sixteenth Text REtrieval Conference.
V. Krishnan, S. Das, and S. Chakrabarti. 2005. Enhanced answer type inference from questions using sequential models. In Pro- ceedings of EMNLP.
X. Li and D. Roth. 2002. Learning question classifiers. In Proc. the International Conference on Computational Linguistics (COLING).
George A. Miller. 1995. WordNet: a lexical database for English.Communications of the Association for Computing Machinery,38(11):39–41.
Christopher Pinchak and Dekang Lin. 2006. A Probabilistic Answer Type Model. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Lin- guistics, pages pages 393 – 400.
Sanda Harabagiu, Dan Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan Bunsecu, Roxana Girju, Vasile Rus, and Paul Morarescu. 2000. FALCON: Boosting knowledge for answer engines. In Proceedings of the 9th Text REtrieval Conference.