Non-uniform Language Detection

Technical documents are typically the result of a conjunctive effort of multiple authors. Different parts of technical content may contain different noticeable writing style or terminology, which is inconvenient and confusing for the end users. This work aims to develop a system that can automatically detect and correct non-uniform language in technical content.

The first stage of this work is  a nonuniform sentence filter  based on the syntax, structure, and semantic meaning of the text at the sentence level. Using Cosine similarity, Longest Common Sub-sequence and Google Tri-gram Method. This stage is implemented as web application. It can be found here.

The next stage of work is an SVM regression model based on features

  • Syntactic Analysis (Character n-gram).
  • Part of Speech Tagging of variations in text.
  • Semantic analysis using Google Tri-gram Method and Google 5-gram matching of context.
  • Flicker Based Concepts.