Relatedness/Similarity of words and text using Google n-grams (2012-)

Aminul Islam
Evangelos Milios
Vlado Keselj

Word relatedness and text relatedness have many applications in natural language processing (NLP), and in many other related areas. Corpus-based word/text relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based word relatedness measures in the literature that cannot be compared to each other as they use a different corpus. We show how to evaluate different corpus-based measures of word relatedness by calculating them over a common corpus and then assessing their performance with respect to gold standard relatedness datasets. We propose a word relatedness and a text relatedness approach using tri-gram statistics.

Publications

Jessica Perrie, Aminul Islam, Evangelos Milios, “How Document Properties Affect Document Relatedness Measures”, in Proceedings of the 15^th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014, Part II, LNCS 8404, Springer, pp. 392-403, Kathmandu, Nepal, April 2014. [available at Springer Link] [presentation slides]
Aminul Islam, Evangelos Milios, Vlado Keselj, “Comparing Word Relatedness Measures Based on Google n-grams”, in Proceedings of the 24^th International Conference on Computational Linguistics, COLING 2012, pp. 495-506, Mumbai, India, December 2012. [available at aclweb] [poster]
Aminul Islam, Evangelos Milios, Vlado Keselj, “Text Similarity using Google Tri-grams”, in L. Kosseim and D. Inkpen (Eds.): Proceedings of the 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, LNAI 7310, Springer, pp. 312-317, Toronto, Canada, May 2012. [available at Springer Link] [presentation slides]

Submitted for Publications
Password protected draft