Претрага
66 items
-
Resource-based WordNet Augmentation and Enrichment
In this paper we present an approach to support production of synsets for SerbianWordNet(SerWN)byadjustingPrincetonWordNet(PWN)synsetsusing several bilingual English-Serbian resources. PWN synset definitions were automatically translated and post-edited, if needed, while candidate literals for Serbian synsets were obtained automatically from a list of translational equivalents compiled form bilingual resources. Preliminary results obtained from a setof1248selectedPWNsynsetsshowthattheproducedSerbiansynsetscontain 4024 literals, out of which 2278 were offered by the system we present in this paper, whereas experts added the remaining 1746. Approximately one half of ...... strategies proposed in ((Oliver et al., 2015)) for automatic construction of the required corpora: by machine translation of sense-tagged corpora and by automatic sense-tagging of English-Serbian parallel corpora. POS tag annotation of bilingual en-sr parallel list is also envisaged, with the aim of ...
... of Language Translation API, which, unlike the official Google Language Translation API, produces text translated into Serbian in Latin script, instead of Cyrillic, and serializes it into a plain text file.3 An example of a list item is: ENG30-08331011-n | a court with jurisdiction in equity | chancery; ...
... use of other available resources for development and enrichment of wordnets have also been proposed. Thus, Oliver and Climent (2014) used parallel corpora for five European languages to produce aligned wordnets. The English part of each corpus was semantically tagged, after which the process of wordnet ...Ranka Stanković, Miljana Mladenović, Ivan Obradović, Marko Vitas, Cvetana Krstev. "Resource-based WordNet Augmentation and Enrichment" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018)
-
Towards translation of educational resources using GIZA++
... Integrated Environment for Development of Parallel Corpora (in Serbian). In: Die Unterschiede zwischen dem Bosnischen/Bosniakischen, Kroatischen und Serbischen (pp. 563-578), B. Tošović (Ed.). Berlin: LitVerlag 2008 [13] Digital library for parallel text Biblisha Online user manual, http://jerteh.r ...
... parallel corpora [17]. Volk et al. argue that automatic word alignment allows for major innovations in searching parallel corpora. Some online query systems already employ word alignment for sorting translation variants, but they describe the system for efficiently searching large parallel corpora with ...
... and insertion of the search results into the text being translated. 4. ENVIRONMENT FOR TEXT ALIGNMENT Preliminary phase for the text alignment (parallelization) consists of XML document (eXtensible Markup Language) preparation according to TEI (Text Encoding Initiative) consortium guidelines. ...Ivan Obradović, Dalibor Vorkapić, Ranka Stanković, Nikola Vulović, Miladin Kotorčević. "Towards translation of educational resources using GIZA++" in The Seventh International Conference on e-Learning (eLearning-2016), September 2016, Belgrade : Metropolitan Univesity (2016)
-
OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian
Ovaj rad predstavlja novi jezički resurs za pretraživanje i istraživanje verbalnih aspektnih parova u BCS (bosanskom, hrvatskom i srpskom), kreiran korišćenjem principa Lingvističkih Povezanih Otvorenih Podataka (LLOD). Pošto ne postoji resurs koji bi pomogao učenicima bosanskog, hrvatskog i srpskog kao stranih jezika da prepoznaju aspekt glagola ili njegove parove, kreirali smo novi resurs koji će korisnicima pružiti informacije o aspektu, kao i link ka aspektnim parovima glagola. Ovaj resurs takođe sadrži spoljne linkove ka monolingvalnim rečnicima, Wordnetu i BabelNetu. ...Ranka Stanković, Maxim Ionov, Medina Bajtarević, Lorena Ninčević. "OntoLex Publication Made Easy: A Dataset of Verbal Aspectual Pairs for Bosnian, Croatian and Serbian" in Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, Turin, 20-25 May 2024, ELRA and ICCL (2024)
-
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being ...Branislava Šandrih, Cvetana Krstev, Ranka Stanković. "Two approaches to compilation of bilingual multi-word terminology lists from lexical resources" in Natural Language Engineering, Cambridge University Press (CUP) (2020). https://doi.org/10.1017/S1351324919000615
-
A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed ...... Education and Science under the grant #III 47003. References Gravano, L. Nezinger, M.H. (2006). Systems and Methods for Using Anchor Text as Parallel Corpora for Cross-Language Information Retrieval - US Patent 7,146,358 B1 - Google Patents. Kovačević, Lj., Injac, V., Begenišić, D. (2004) ...
... for each article, links are offered to the full text of the article in .pdf format (residing on the official site of the INFOtheca journal) as well as the entire aligned parallel text of the article in .html format. More powerful is the full-text search (Figure 5). The user initiates this search ...
... Thus, for example, the OPUS corpus offers freely available parallel corpora in many languages, as well as interfaces for querying the corpus data [Tiedemann, 2009]. Another example of a system that uses parallel corpora for information retrieval is given in [Gravano, 2006]. The HLT group ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Aleksandra Trtovac, Miloš Utvić. "A Tool for Enhanced Search of Multilingual Digital Libraries of E-journals" in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, May 2012, Istanbul, Turkey, Istanbul, Turkey : European Language Resources Association (2012)
-
Towards Automatic Definition Extraction for Serbian
U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog ...... Pollak, S., Vavpetic, A., Kranjc, J., Lavrac, B. & Vintar, Š. (2012). NLP workflow for on-line definition extraction from English and Slovene text corpora. In: Proceedings of KONVENS 2012, Vienna, September 19, 2012, pp. 53–60. Ristić, S., Кonjik Lazić, I. & Ivanović, N. (2018) Metajezik leksikografske ...
... A finite state transducer “passes” through the text it analyses to compare a text chunk with the model it represents. In the case of successful recognition, a final state transducer produces some result, which can be a modification of the source text by adding tags for types of recognized 1 Un ...
... year of publishing, subject, school level (primary, secondary) and school class. As a guest, a user can presently search several corpora under NoSkatchEngine more corpora will be available in the near future. https://noske.jerteh.rs/#dashboard?corpname=SkolKor domain scope recogni zed correct ...Ranka Stanković, Cvetana Krstev, Rada Stijović, Mirjana Gočanin, Mihailo Škorić. "Towards Automatic Definition Extraction for Serbian" in Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Democritus University of Thrace (2021)
-
Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data
Овај рад описује студију случаја о генерисању повезаних података креираних на основу обечежених текстуалних корпуса коришћењем формата размене података у обради природних језика (NIF). Као основа за ово истраживање послужио је подскуп корпуса ELTeC, који се састоји од 900 романа из периода 1840-1920 за 9 европских језика. Верзија романа са коментарима, у такозваном TEI level-2 формату, трансформисана је у NIF, формат заснован на RDF/OWL који има за циљ постизање интероперабилности између алата за обраду природних језика, језичких ресурса и ...Ranka Stanković, Christian Chiarcos, Miloš Utvić, Olivera Kitanović. "Towards ELTeC-LLOD: European Literary Text Collection Linguistic Linked Open Data" in LDK 2023 – 4th Conference on Language, Data and Knowledge, 12-15 September in Vienna, Austria, Lisabon : NOVA FCSH - CLUNL (2023). https://doi.org/10.34619/srmk-injj
-
From DELA Based Dictionary to Leximirka Lexical Database
Biljana Lazić, Mihailo Škorić (2020)In this paper, we will present an approach in transforming Serbian language Morphological dictionaries from a DELA text format to a lexical database dubbed Leximirka. Considering the benefits of storing data within a database when compared to storing them in textual documents, we will outline some of the functionality that the database has made possible. We will also show how hand-made rules that use category labels lexical entries are marked with can be used to link lexical entries. ...... the Lex- imirka application: – data categories (option Categories), – dictionaries (option Lexicons), – lexical entries (option Entries), – corpora (option Corpora), 7 Ekavian dialect the reflection of the Old-Church Slavonic “Jat” is an “e”,while in Iekavian it can be “je”, “ije” or “i”. Infotheca ...
... dictionary to . . . ”, pp. 81–98 of terms, the extraction of time expressions and advanced search of text repositories and libraries. The morphological dictionaries were developed in the DELA text format (fr. Dictionnaires électroniques du LADL2 ) which will be discussed in Sec- tion 2.1. As the ...
... and to make them in- teroperable and reusable. Three standards for lexical information have been considered: Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative (TEI)3, Lexical Markup Framework (LMF)4 and the Lemon model5. Although Chapter 9 of the TEI Guidelines addresses ...Biljana Lazić, Mihailo Škorić. "From DELA Based Dictionary to Leximirka Lexical Database" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.4
-
Softverski alati za korišćenje resursa za srpski jezik
Ivan Obradović, Ranka Stanković (2008)... (Gale and Church, 1993). Figure 2 depicts an example of an aligned text represented in the WS4LR tool. It is a legal texts in English and Serbian, aligned at the sentence level. Figure 2. Example of an aligned text Parallel corpora are very useful in the research pertaining to bilingual but also ...
... being used (Ohmori and Higashida, 1999). The procedure of transforming a parallel text into an aligned text consists of two basic steps. In the first step parallel texts are split into segments, that is, basic units of text. Usually, sentences are chosen for segments, but segments can be larger, such ...
... “highlighting”, namely by repre- senting them in blue, in order to make them more easily recognizable in the text. The text in Eng- lish is on the left hand side, and the correspond- ing text in Serbian on the right. Results obtained by searching aligned texts with bilingual queries can be used for ...Ivan Obradović, Ranka Stanković. "Softverski alati za korišćenje resursa za srpski jezik" in INFOteka: časopis za informatiku i bibliotekarstvo, Belgrade, Serbia : Zajednica biblioteka univerziteta u Srbiji (2008)
-
Rule-based Automatic Multi-word Term Extraction and Lemmatization
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from ...... statistical corpus based term extraction algorithm used on English and Chinese corpora is described in (Pantel&Lin, 2001), while Chen and his associates present a MWT extraction system based on co-related text-segments within a set of documents (Chen et al., 2006). Statistical measures of ...
... place with very little human intervention, starting from the tokenization and lexical analysis of a raw text up to production of dictionary entries. The system relies Unitex routines for text analysis and FST application, while one of the many functionalities of LeXimir is used to produce dictionary ...
... 2012). However, the two approaches are more and more often combined in a hybrid approach. An approach to extracting MWTs from Arabic specialized corpora that uses linguistic rules to parse documents and retrieve candidate terms and statistical measures to deal with ambiguities and rank candidate ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. "Rule-based Automatic Multi-word Term Extraction and Lemmatization" in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, Portorož, Slovenia, 23--28 May 2016, European Language Resources Association (2016)
-
FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain
U radu se daje kratak prikaz teorije semantike okvira, na kojoj je zasnovana leksička baza Frejmnet. Predstavljena je koncepcija ove mreže, kao i mogućnosti njene primene. Predstavljena je i leksička analiza koja se primenjuje u projektu izrade Frejmneta i ukazano na razlike između analize zasnovane na okviru u odnosu na analizu zasnovanu na reči. Zatim je prikazano nekoliko povezanih okvira koje prizivaju reči iz domena rizika. U radu je predstavljena i platforma NLTК pomoću koje se mogu koristiti ...... Toolkit) is an easy-to-use natural language pro- cessing Python suite that accesses continually increasing number of corpora and lexical resources. NLTK offers different types of text processing, amongst which are: classification, tokenization, stemming, tagging, parsing and se- mantic reasoning. The ...
... roles and typical semantic-syntactic patterns of the most frequent verbs were presented for each of the corpora. The verb to be and the semantic role of patient were the most frequent in both corpora, while the second place went to the role of agent (95–96). In the paper, semantic roles were labeled in ...
... actually used, an anal- ysis of corpus data proves to be a fairly complicated task, in view of the number of concordances proposed by contemporary corpora for certain key words. Frame semantics theory, as cited by the following authors (Atkins 1994; Gildea and Jurafsky 2002; Atkins, Fillmore, and Johnson ...Aleksandra Marković, Ranka Stanković, Natalija Tomić, Olivera Kitanović. "FrameNet Lexical Database: Presenting a Few Frames Within the Risk Domain" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.1
-
A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian
Uvredljivi govor na društvenim medijima, uključujući psovke, pogrdni govor i govor mržnje, dostigao je nivo pandemije. Sistem koji bi bio u stanju da detektuje takve tekstove mogao bi da pomogne da internet i društveni mediji postanu bolji virtuelni prostor sa više poštovanja. Istraživanja i komercijalna primena u ovoj oblasti do sada su bili fokusirani uglavnom na engleski jezik. Ovaj rad predstavlja rad na izgradnji AbCoSER-a, prvog korpusa uvredljivog govora na srpskom jeziku. Korpus se sastoji od 6.436 ručno označenih ...... High-quality corpora of hate speech, offensive speech, and abusive language are very important as a first step in building an automated system for the detection of these phe- nomena ([51, 52, 1, 6]). Warner and Hirschberg [44] presented their research on hate speech toward minority groups in online text, with ...
... the levels is clearer). The main advantage is that the same scheme can be used for general-purpose hate speech corpora, which includes several types of hate speech, and for specific corpora, which usually cover only one type of hate speech (racial hatred, misogyny, hatred of migrants, etc.). The first ...
... hate speech as described in [42]; 3) Classifiers trained on corpora containing general abusive speech, can be used to classify a domain hate speech corpus, while domain-specific classifiers perform poorly on the general data set and corpora from other hate speech domains ([46, 29]); therefore, instead ...Danka Jokić, Ranka Stanković, Cvetana Krstev, Branislava Šandrih. "A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian" in 3rd Conference on Language, Data and Knowledge (LDK 2021), MDPI AG (2021). https://doi.org/10.4230/OASIcs.LDK.2021.13
-
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... parallel corpora. In Pro- ceedings of the 23rd International Conference on Com- putational Linguistics: Posters, COLING ’10, pages 1256–1264, Stroudsburg, PA, USA. Association for Computational Linguistics. Vintar, Š. and Fišer, D. (2008). Harvesting multi-word ex- pressions from parallel corpora. In ...
... bilingual aligned termi- nological list. 2. Related Work In recent years extraction of bilingual MWTs, and MWEs in general, from bilingual aligned corpora has been ex- ploited by many researchers. Although most of them rely on automatic word alignment they differ both in resources and techniques used ...
... morphological dictionaries. We will apply the same approach to other domains – min- ing, electro-distribution and management – since aligned domain corpora have already been prepared. At the same time the presented system will be improved with the user friendly interface for presentation of the results ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
-
Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons
Mihailo Škorić (2017)The goal of this paper is to draw attention to the possibility of using emoticon-riddled text on the web in language-neutral sentiment analysis. It introduces several innovations in the existing framework of research and tests their effectiveness. It also presents a software tool especially made for that purpose, explains how it builds a database with sentimental value of terms and offers the user manual. Finally, it presents a software tool that tests the new database and gives some examples ...... meaning of written text, but only the grammar of the language that text is written on, which enables wider application. – Software that has a deeper understanding of the meaning of the text, often limited to one or a small number of areas. This type of software is predominantly used for text classification ...
... message does not contain text, and its determiner must refer to previous message. 3. if the message contains both the determiner and the text, and the following message contains determiner but not text – determiners from both messages will refer to the message that contains text. Example: A: I missed the ...
... g and analysis: understanding of written text and text queries, analysis of moods in the text, processing of digital linguistic resources such as automatic parallelization and automation of any operation that requires a deep understanding of the written text. – Artificial intelligence: automated co ...Mihailo Škorić. "Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons" in Infotheca, Faculty of Philology, University of Belgrade (2017). https://doi.org/10.18485/infotheca.2017.17.1.4
-
Српски језик у дигиталном добу -- The Serbian Language in the Digital Age
Duško Vitas, Ljubomir Popović, Cvetana Krstev, Ivan Obradović, Gordana Pavlović-Lažetić, Mladen Stanojević (2012)... analysing bilingual text corpora, paral- lel corpora, such as the Europarl parallel corpus, which contains the proceedings of the European Parliament in 21 European languages. Given enough data, statistical MT works well enough to derive an approximate meaning of a foreign language text by processing parallel ...
... generation 0 0 0 0 0 0 0 Machine translation 1 1 0 1 0 1 1 Language Resources (Resources, Data and Knowledge Bases) Text corpora 0,5 1 0,5 1 1 1 0,5 Speech corpora 1 2 4 4 3 3 3 Parallel corpora 3 3 3 2 2 2 3 Lexical resources 1 2 2 2 2 2 2,5 Grammars 1 1 0 1 0 1 1 11: State of language technology support ...
... available MT applications ‚ Text Analysis: uality and coverage of existing text analysis technologies (morphology, syntax, se- mantics), coverage of linguistic phenomena and do- mains, amount and variety of available applications, quality and size of existing (annotated) text corpora, quality and coverage ...Duško Vitas, Ljubomir Popović, Cvetana Krstev, Ivan Obradović, Gordana Pavlović-Lažetić, Mladen Stanojević. "Српски језик у дигиталном добу -- The Serbian Language in the Digital Age" in META-NET White Paper Series, G. Rehm, H. Uszkoreit (eds.), Springer (2012)
-
A Lexical Approach to Acronyms and their Definitions
In this paper we present a comprehensive approach to acronyms for Natural-Language Processing (NLP) of Serbian texts. The proposed procedure includes extraction of acronyms and their definitions that are usual Multi-Word Units (MWUs), shallow parsing of MWUs that enables MWU lemmatization and production of entries in morphological electronic dictionaries, both for MWU and acronyms, that are provided with grammatical, syntactic, semantic and domain information. This approach enables representation that reflects complex relations between acronyms and their definitions.... training corpora, while those based on lexical resources do not have them listed in lex- icons. However, their adequate treatment is crucial for many applications, e.g. text-to-speech systems (Taylor, 2009), machine translation (Wolinski et al., 1995), index- ing for information retrieval and text cl ...
... biomed- ical text. In Pacific Symposium on Biocomputing, vol- ume 8. World Scientific. Spasic, I., S. Ananiadou, J. McNaught, and A. Kumar, 2005. Text mining and ontologies in biomedicine: mak- ing sense of raw text. Briefings in bioinformatics, 6(3):239–251. Taylor, Paul, 2009. Text-to-speech synthesis ...
... tual Incompletness. In Proc. of the Corpus Linguistics Conference, Birmingham. Liberman, Mark Y and Kenneth W Church, 1992. Text analysis and word pronunciation in text-to-speech syn- thesis. Advances in speech signal processing:791–831. Moon, S., S. Pakhomov, and G. B. Melton, 2012. Auto- mated ...Cvetana Krstev, Duško Vitas, Ranka Stanković. "A Lexical Approach to Acronyms and their Definitions" in Proceedings of the 7th Language & Technology Conference, November 27-29, 2015, Poznań, Poland, Springer (2015)
-
EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School
Prva škola za obuku polaznika koju je organizovala COST akcija NexusLinguarum održana je od 8. do 12. februara 2021. godine sa ciljem da studenti, istraživači i stručnjaci nauče osnove lingvističke nauke o podacima. Tokom obuke polaznici su se upoznali sa širokim spektrom tema: od semantičkog veba, RDF -a i ontologija, do modeliranja i pretraživanja jezičkih podataka pomoću najsavremenijih ontoloških modela i alata. Škola je održana u okviru serije letnjih škola EUROLAN-a i organizovalo ju je virtuelno (onlajn) nekoliko instituta; ...nauka o lingvističkim podacima, povezani podaci u lingvistici, jezički podaci, EUROLAN, NexusLinguarum, COST akcija, škola za obuku... (McCrae et al. 2017; Declerck, Tiberius, and Wandl- Vogt 2017; Stanković et al. 2018) – Linguistic linked data generation; (Cimiano et al. 2020) – Corpora and linked data; (Chiarcos 2012) – Linguistic annotations; (Fäth et al. 2020) – NLP Interchange Format; (Hellmann et al. 2013) – Tools and applications ...
... Elena Montiel-Ponsoda. 2017. “Towards a Module for Lexicography in OntoLex.” In LDK Workshops, 74–84. Chiarcos, Christian. 2012. “Interoperability of corpora and annotations.” In Linked Data in Linguistics, 161–179. Springer. Chiarcos, Christian, Maxim Ionov, Jesse de Does, Katrien Depuydt, Fahad Khan, ...Milan Dojchinovski, Julia Bosque Gil, Jorge Gracia, Ranka Stanković. "EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.7
-
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... Mining and Geology ranka.stankovic@rgf.bg.ac.rs Belgrade, Serbia 1 Text mending – introduction to problems Text mending is one of the simplest text transformation problems, when compared to speech recognition and generation, text summarization and machine translation. It is also one of the first problems ...
... character recognition (OCR) is applied. A text that fully corresponds to the original is rarely obtained since OCR is prone to errors. The quality of the resulting text depends on various factors: the software used, quality of the paper and print of the original text, and its language and alphabet. OCR software ...
... to a clean text.5 A text after OCR - Е. *нпjе него броћ! Тебе ће неко *еад *пптатн шта ти хоћеш, а *пгга нећеш! Него. кажи ти мени. jе ли теби *бнла позната моjа наредба, коjом се забрањуjе тумарање по турским кућама? — *Нпjе. — Jа где си ти *бно за ово месец дана — У *болннци. A text after automatic ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3
-
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment ...... texts used in this research are shown in Table 2. The text 1984, Serbian translation of Orwell’s novel, was anno- tated according to the MULTEXT-East specification and in- cluded in MULTEXT-East resources (version 3) (Krstev et al., 2004). The text Verne, Serbian translation of the novel Around the ...
... on four different manually an- notated set of texts. Test set was compiled of 10% of each text used for training, and it can give a rough idea on how models perform when tagging similar, already familiar text. Verne, History and Novels represent texts previously un- known to the taggers and show their ...
... result when tagging unfamiliar text. Although TreeTagger TT19 seems to have better overall results, the performance of both tag- Figure 1: Part-of-Speech tagging accuracy per token on test sets, for each of trained models gers drops significantly when tagging unknown text. Figure 2: nPoS-tagging accuracy ...Ranka Stanković, Branislava Šandrih, Cvetana Krstev, Miloš Utvić, Mihailo Škorić. "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian" in Proceedings of the 12th Language Resources and Evaluation Conference, May Year: 2020, Marseille, France, European Language Resources Association (2020)
-
Integracija heterogenih tekstualnih resursa
Ranka Stanković, Ivan Obradović (2007)U radu je opisan pristup integraciji heterogenih tekstualnih resursa za srpski jezik uz pomoć jednog kompleksnog softverskog alata, razvijenog specijalno za ove potrebe. Opisani su struktura i osnovne komponente razvijenog sistema. Iznete su i mogućnosti unapređivanja resursa međusobnom razmenom informacija, koje pruža razvijeno integrisano okruženje. Konačno, opisana je i mogućnost primene integrisanih heterogenih resursa za proširenje upita, kao i pretraživanje tekstova uopšte, a naznačeni su i neki od pravaca daljeg razvoja.... components of the system we developed under the name of WS4LR (WorkStation for Lexical Resources), which synchronously handles corpora of Serbian, multilingual aligned corpora, a system of morphological dictionaries for Serbian, the Serbian wordnet and the multilingual ontology of proper names Prolex ...
... where part of the functions of WS4LR would be accessible via the internet, and which would at the same time provide for integration of WS4LR and the corpora of Serbian that are also partially accessible via the internet. A related public web service for query expansion is also planned, as well as a ...Ranka Stanković, Ivan Obradović. "Integracija heterogenih tekstualnih resursa" in Zbornik radova međunarodnog simpozijuma Razlike između bosanskog/bošnjačkog, hrvatskog i srpskog jezika, Graz, Austria, April 2007, - (2007)