Relatedness/Similarity of words and text using Google n-grams (2012-)

Aminul Islam
Evangelos Milios
Vlado Keselj

Word relatedness and text relatedness have many applications in natural language processing (NLP), and in many other related areas. Corpus-based word/text relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based word relatedness measures in the literature that cannot be compared to each other as they use a different corpus. We show how to evaluate different corpus-based measures of word relatedness by calculating them over a common corpus and then assessing their performance with respect to gold standard relatedness datasets. We propose a word relatedness and a text relatedness approach using tri-gram statistics.


    1. Jessica Perrie, Aminul Islam, Evangelos Milios, “How Document Properties Affect Document Relatedness Measures”, in Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014, Part II, LNCS 8404, Springer, pp. 392-403, Kathmandu, Nepal, April 2014. [available at Springer Link] [presentation slides]
    2. Aminul Islam, Evangelos Milios, Vlado Keselj, “Comparing Word Relatedness Measures Based on Google n-grams”, in Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, pp. 495-506, Mumbai, India, December 2012. [available at aclweb] [poster]
    3. Aminul Islam, Evangelos Milios, Vlado Keselj, “Text Similarity using Google Tri-grams”, in L. Kosseim and D. Inkpen (Eds.): Proceedings of the 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, LNAI 7310, Springer, pp. 312-317, Toronto, Canada, May 2012. [available at Springer Link] [presentation slides]

Submitted for Publications
Password protected draft