SemMap – The semantic map for German language

During my long absence from code is poetry labs, I’ve worked on my master thesis. My thesis was on the topic Question Answering using Unstructured Data and was created to solve the QA task for the German language. The thesis produced a lot of byproducts and ideas to continue the work on this topic. All of these byproducts will be released in the next month on this blog as well as software projects on cip-labs github account.

What is SemMap

This post deals with the first semantic database (SEMantic MAP) that allows you to calculate on semantic frames for the German language. A semantic frame is a cluster of words that have the same semantic meaning. The representation of the semantic map is stored in JSON. An example:

"Commerce_collect": [
   "kassieren",
   "bitten",
   "verlangen",
   "berechnen",
   "nehmen",
   "essen",
   "futtern",
   "einkassieren"
 ],

SemMap also have calculated correlations that allow you to run calculations on semantics. The representation can be seen as a matrix, where the correlation of each frame to each frame is stored:

"Commerce": {
  "Memory": 0.00024714220074908,
  "Choosing": 0.0048732264936439,
  "Kinship": 0.016520237813453,
  "Assessing": 0.00043510950836106,
  "Deserving": 0.00019492905974576,
  "Destroying": 0.00018796730761198,
  "Part_ordered_segments": 0.016509795185252,
  "Part_orientational": 0.008117402987984,
  "Attention_getting": 0.0039403517077178,
  ...
  "Direction": 0.018431238774175
},

With this correlations you have a cluster representation of each frame. Before introducing SemMap each found semantic frame was a number that could be accumulated, not more. From now on these found frames have a vector representation with linkings to all other frames.

Data sources

The data sources of the SemMap are all legacy projects from a few universities. These projects can be seen as the German equivalent of the FrameNet. The frames of these projects were extracted, cleaned and some of them were used in SemMap. Beside this, SemMap was extended by a lot of open source dictionaries.

Correlation of frames

The frames and the containing words are just the beginning of the SemMap. The heart of this project is the correlation calculation. Each word (and therefore each frame) has correlations to other semantic frames. To calculate these correlations the SemMap was trained on a narrowed document set from the german Wikipedia (cardinality of 1.3 Mio. documents).

On these documents two correlations are calculated:

  • Weak Correlation: With the pivot on word index x, another frame occurs in the range ofx-20 up to x+20
  • Strong Correlation: With the pivot on word index x, another frame occurs in the range of x-5 up to x+5

The representation of this calculations can be seen as matrix, with the correlation probability from frame A to frame B or the correlation probability that frame B co-occurs if frame A already occurred. This representation solves the following problems:

  • The occurrence of a frame is nothing more than a accumulated number (now it is a vector representation in a semantic space)
  • You can not use mathematical calculations on sentences and words (now you can, you can add, subtract, multiply and average textual components)

Possible use cases

The possible use cases for SemMap are very wide. The obvious use cases are listed below:

  • The SemMap can be used to find semantic frames of words in the text.
  • Correlation maps can be used to build a vector representation and therefore a feature space for text classification
  • Self trained correlation maps can be used to cluster your input before running a classification on it
  • Correlation maps are useful to tackle intention analysis and deep semantic understanding

The presented use cases are just the obvious ones.

How to use it

This section explains how to use the map and all corresponding files.

Use SemMap and correlation files

# get content from semmap file into $content
$semmap = json_decode($content)
print_r($semmap[<frame>]); # prints frame members

Working with correlation files:

# get content from correlation file into $content
$correlations = json_decode($content);

# print all correlations for frame Commerce_collect
print_r($correlations['Commerce_collect']);

# print the correl. that frame 'Building' occurs near to 'Commerce'
echo $correlations['Commerce']['Building'];

 

Train new correlation files

First you need to adjust the constants in the head of the PHP file train.php. The constants are named very intuitively. You also need training documents to work on it. Lets assume that your training documents are in the directory train_docs/. So you have to run the command:

php train.php train_docs/

Download & Get Involved

You can download the current project from Github. If you want to contribute code, ideas and content to the project please contact us. Visit the NLPG project page on code is poetry laboratories.