The matching of brief sapiential statements is a complex task and in fine devoted to the expertise of the paremiologysts.
The tool we propose here has to be considered as a suggestion engine for specialists. Without interpretation and validation of the suggestions by the experts, the results have no value.
Distributional Semantics Hypothesis
The techniques used here are based on the distributional semantics hypothesis that we could summarize through the following general principle:
« A word is characterized by the relationships he keeps with other words » [Firth 1957]
In other words, we consider that two words have a close meaning if they are used in the same contexts.
Semantic Vector Space
A semantic vector space makes it possible to represent words in the form of vectors, thus enabling their comparison by using simple vector operations.
The basis of a semantic vector space is typically a co-occurrence matrix:
- One line for each distinct term from the training corpus.
- One column for each concept (for instance, every paragraph or article of a corpus)
- The value of a cell represents the weighted frequency of the term in the concept.
Thus, two terms semantically close will have close vectors in the space of the concepts. In other words, the smallest the distance between two concepts, the more semantically close the associated terms will be considered.
In this system, a sentence can be seen like the weighted sum of the vectors of the terms composing it.
The inter-sentence/phrase similarity is then calculated from the distance between two vectors representing each sentence/phrase.
The statements have been annotated in three languages. The model will thus trained for each language (French, English and Spanish).
Multilingual Vector Models for Aliento
Two different models are trained, their results will be combined to have a unique score when calculating the similarities:
WikiRI [Hai Hieu Vu] is an implementation of that technique that makes use of the intrinsic organization of Wikipedia. The concepts are represented first through random vectors of weak dimension, then the representative vectors of the words are calculated by adding up the vectors of the concepts to which they are associated. In our research, we use a version of the Random Indexing (RI) proposed by Niladri Chatterjee and a weighed variant of the Random Indexing used by Wikipedia as a linguistic resource. The role of WikiRI is to express the context at the document level for each term.
Word2Vec [Mikolov] is a predictive model using neural networks in order to learn the vector representations of the words from the training corpus. The induced vectors are dense, of weak dimension and each direction represents a latent characteristic of the word, supposed to capture the syntactic and semantic properties. It is a simple and fast model, implemented by the word2vec tool, recently introduced by Mikolov et al. They use two predictive models based on simple-layer neural networks: skip-gram and Continuous Bag Of Words (CBOW). Given a window of n words about a word w, the skip-gram model predicts the neighboring words in the fixed window. The CBOW model then makes it possible to predict the target word w, given its neighbors in the window. The role of Word2Vec is to express the context at the sentence level for each term.
Vector Representation of a Brief Sapiential Statement
A statement is represented by three vectors representing its semantic annotations: literal sense, figurative sense and lesson.
There is a representation of each vector for both models and for each language.
Similarity Calculation
The inter-statement score represents an absolute value of similarity between two statements. This score is calculated as the weighed sum of similarity scores between each component of each vector of the same type.
The system makes it also possible to compare a free sentence/phrase with different components of the sapiential statements. The score of a comparison between a free sentence/phrase and a statement is obtained by comparing the vector with the sentence/phrase with each of the three vectors of the statement (BSS).
It is also possible to choose with which annotations the comparison is to be made. For example, we might want to compare a sentence/phrase entered with the annotations of the literal sense type of the BSS only.
This calculation is realized using the two models (WikiRI and Word2Vec) and a final score is obtained by combining their scores using the following formula p*WikRI + (1-p)*Word2Vec. We used p = 0.7 for the experimentation on our annotated datasets.