The LX-WordSim-353 was created from WordSim-353 (Agirre et al., 2009). As the name suggests, this data set contains 353 pairs of words. Both words in each pair can have different morphosyntactic categories. The data set is made of nouns, adjectives, verbs and named entities, and has no multiwords.
Originally (Finkelstein, et al., 2002), each pair of words received a human judgement on a scale from 0 (totally unrelated words) to 10 (very much related or identical words).
Agirre et al. (2009) observed that the numeric annotation did not distinguish between similar and related pairs. In an attempt to know which was the true relation between the words of each pair, they advanced with a different approach in the annotation of this data set. Thus, the annotators should classify all pairs as being synonyms, antonyms, identical, hyperonym-hyponym, sibling terms (terms with a common hyperonymy), meronym-holonym or none-of-the-above. With this annotation, they could determine which pairs had a relation of similarity among the two words and which pairs had related words. At the end, they distinguished between the pairs with related words and the pairs with similar words. In the word pairs categorized as synonyms, antonyms, identical and hyperonym-hyponym, there was a relation of similarity between both words. In the word pairs categorized as sibling terms, holonym-meronym or none-of-the-above, which had on average a similarity greater than 5, there was a relation of relatedness between both words.
The LX-WordSim-353 was the outcome of a) the translation of WordSim-353 into Portuguese and b) the annotation of that list with the classification established by Agirre, et al. (2009). The translation process followed the same procedures as the translation of the data sets in the sections above: two translators translated the same data and a third expert adjudicated when there were mismatches.