In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the dataobtainedfromSrpKorandlocalgrammarsassistsinmakingadecisionbetween several candidates in cases of ambiguity. The evaluation results reveal that,dependingonthetext,accuracyrangesfrom95.03%to99.36%,whilethe precision (average 98.93%) is always higher than the recall (average 94.94%).
... engineering, 25(1-2):161–197.
Veale, T. and Hao, Y. (2008). A context-sensitive framework for lexical ontologies. The Knowledge Engineering
Review, 23(1):101–115.
Wilks, Y. (2009). Ontotherapy, or how to stop worrying about what there is. Recent advances in natural language
processing V, pages 1–20.
Will ...
... related to
social, national and religious conflicts, extremism and terrorism, information security.
The Table 1 contains quantitative characteristics of the above-mentioned resources.
Table 1: RuThes-like Thesauri
Thesaurus Number of Number of Number of
concepts Text Entries Conceptual Relations
RuThes ...
... convenient for text analytics
and information-analytical systems in specific domains.
Fig. 1-2 show the interface of thesaurus developing. The upper left form contains a list of concepts in
alphabetical order. Fig. 1 shows concepts from the Sociopolitical thesaurus: Import of weapons, Import
of information ...
Cvetana Krstev, Ranka Stanković, Duško Vitas. "Knowledge and Rule-Based Diacritic Restoration in Serbian" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018): 41-51