Abstract Owing to the growing need of acquiring medical data from clinical records,
processing such documents is an important topic in natural language processing (NLP).
However, for general NLP methods to work, a proper, normalized input is required.
Otherwise the system is overwhelmed by the unusually high amount of noise generally
characteristic of this kind of text. The different types of this noise originate from
non-standard language use: short fragments instead of proper sentences, usage of Latin
words, many acronyms and very frequent misspellings. In this paper, a method is described
for the automated correction of spelling errors in Hungarian clinical records. First,
a word-based algorithm was implemented to generate a ranked list of correction candidates
for word forms regarded as incorrect. Second, the problem of spelling correction was
modelled as a translation task, where the source language is the erroneous text and
the target language is the corrected one. A Statistical Machine Translation (SMT)
decoder performed the task of error correction. Since no orthographically correct
proofread text from this domain is available, we could not use such a corpus for training
the system. Instead, the word-based system was used to create translation models.
In addition, a 3-gram token-based language model was used to model lexical context.
Due to the high number of abbreviations and acronyms in the texts, the behaviour of
these abbreviated forms was further examined both in the case of the context-unaware
word-based and the SMT-decoder-based implementations. The results show that the SMT-based
method outperforms the first candidate accuracy of the word-based ranking system.
However, the normalization of abbreviations should be handled as a separate task.