Context-aware correction of spelling errors in Hungarian medical documents

Siklósi, Borbála [Novák, Borbála (Nyelvtechnológia), szerző] Interdiszciplináris Műszaki Tudományok Doktori ... (PPKE / ITK); Novák, Attila [Novák, Attila (Nyelvtechnológia), szerző] MTA-PPKE Magyar Nyelvtechnológiai Kutatócsoport (PPKE / ITK); Prószéky, Gábor [Prószéky, Gábor (Számítógépes nyel...), szerző] MTA-PPKE Magyar Nyelvtechnológiai Kutatócsoport (PPKE / ITK)

Angol nyelvű Tudományos Szakcikk (Folyóiratcikk)
  • Nyelvtudományi Bizottság: INT2
  • SJR Scopus - Human-Computer Interaction: Q2
    Abstract Owing to the growing need of acquiring medical data from clinical records, processing such documents is an important topic in natural language processing (NLP). However, for general NLP methods to work, a proper, normalized input is required. Otherwise the system is overwhelmed by the unusually high amount of noise generally characteristic of this kind of text. The different types of this noise originate from non-standard language use: short fragments instead of proper sentences, usage of Latin words, many acronyms and very frequent misspellings. In this paper, a method is described for the automated correction of spelling errors in Hungarian clinical records. First, a word-based algorithm was implemented to generate a ranked list of correction candidates for word forms regarded as incorrect. Second, the problem of spelling correction was modelled as a translation task, where the source language is the erroneous text and the target language is the corrected one. A Statistical Machine Translation (SMT) decoder performed the task of error correction. Since no orthographically correct proofread text from this domain is available, we could not use such a corpus for training the system. Instead, the word-based system was used to create translation models. In addition, a 3-gram token-based language model was used to model lexical context. Due to the high number of abbreviations and acronyms in the texts, the behaviour of these abbreviated forms was further examined both in the case of the context-unaware word-based and the SMT-decoder-based implementations. The results show that the SMT-based method outperforms the first candidate accuracy of the word-based ranking system. However, the normalization of abbreviations should be handled as a separate task.
