Претрага
88 items
-
Development of Open Educational Resources (OER) for Natural Language Processing
In this paper we present the development of an online course at the edX BAEKTEL platform named “Lexical Recognition in the Natural Language Processing (NLP)”. It is based on the course of the same name for PhD studies at the University of Belgrade, Faculty of Philology. There are not many courses in Computational Linguistics (CL) on OER platforms, and there is none in Serbian either for CL or NLP. We have developed this course in order to improve this ...... improve this situation as it can prove useful both for linguists working in corpus linguistics and computer scientists developing NLP applications. The participant will become familiar with the use of Unitex, the corpus processing system for which many valuable resources for Serbian were already ...
... oriented processing of texts in human languages. As an illustration, a string oriented web interface to the Corpus of Contemporary Serbian 13 is presented. [13][14] 2. Unitex corpus processing system is presented from the practical point of view: how to install it and start working with it ...
... within BAEKTEL project of its OER version within the edX BAEKTEL platform. 2 The main features of Unitex, an open access and open source corpus processing system, are presented in Section 4. Section 5 presents course content with didactic criteria and specific formats used in the OER course ...Cvetana Krstev, Biljana Lazić, Ranka Stanković, Giovanni Schiuma, Miladin Kotorčević. "Development of Open Educational Resources (OER) for Natural Language Processing" in The Sixth International Conference on e-Learning (eLearning-2015), September 2015, Belgrade, Serbia, Belgrade : Belgrade Metropolitan Univesity (2015)
-
Old or New, We Repair, Adjust and Alter (Texts)
Cvetana Krstev, Ranka Stanković (2020)U ovom radu predstavljamo kako se e-rečnici i kaskade transduktora konačnih stanja implementirani u alatu Unitex mogu koristiti za rešavanje tri problema transformacije teksta: ispravljanje tekstova nakon OCR-a, vraćanje dijakritičkih znakova i prebacivanje između različitih jezičkih varijanti.ispravka teksta, OCR greške, restauracija dijakritika , jezičke varijante, elektronski rečnik, transduktori konačnih stanja... Serbian morphological electronic dictionaries (SMD) (Krstev, 2008); 2 This corpus is a part of the European Literary Text Collection corpus (ElTEC) developed in the scope of the COST action 16204 Distant Reading for European Literary History (d-reading). 64 Infotheca Vol. 19, No. 2, December 2019 Scientific ...
... present the solution for detecting and correcting OCR errors developed for the compilation of the corpus of Serbian novels written and published in the period 1840–1920.2 The novels selected for this corpus were mainly printed in Cyrillic script (only a few of them were in Latin script). When scanning ...
... performed by finite-state transducers (FST) implemented in Unitex.3 A separate FST is written for each replacement 3 UnitexGramLab, a lexicon-based corpus processing suite. Infotheca Vol. 19, No. 2, December 2019 65 Krstev C., Stanković R., “Old or new, we repair . . . ”, pp. 61–80 a non-valid ...Cvetana Krstev, Ranka Stanković. "Old or New, We Repair, Adjust and Alter (Texts)" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.3
-
Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons
Mihailo Škorić (2017)The goal of this paper is to draw attention to the possibility of using emoticon-riddled text on the web in language-neutral sentiment analysis. It introduces several innovations in the existing framework of research and tests their effectiveness. It also presents a software tool especially made for that purpose, explains how it builds a database with sentimental value of terms and offers the user manual. Finally, it presents a software tool that tests the new database and gives some examples ...... databases for this study were created, from the collection of the corpus to the export of completed database, which can then be used in several ways. 2.1 Collecting textual corpus The basic idea was for the database to be based on a corpus of texts containing determiners which express positive or negative ...
... pendent, the system would be language-independent as well. If it turns out to be valid, this method could allow machine learning the usage of huge corpus of texts that are pre-labeled with determiners. 1.1 Review of their former similar studies In 2005 a series of experiments with the classification ...
... and language-neutral determiner strings will be used. Goal is to create a fully language-independent system that would greatly broaden the possible corpus. 1 Users of Twitter platform have an option to additionally mark their posts with tags so that posts that talk about a certain topic can be found ...Mihailo Škorić. "Classification of Terms on a Positive-Negative Feelings Polarity Scale Based on Emoticons" in Infotheca, Faculty of Philology, University of Belgrade (2017). https://doi.org/10.18485/infotheca.2017.17.1.4
-
Using English Baits to Catch Serbian Multi-Word Terminology
In this paper we present the first results in bilingual terminology extraction. The hypothesis of our approach is that if for a source language domain terminology exists as well as a domain aligned corpus for a source and a target language, then it is possible to extract the terminology for a target language. Our approach relies on several resources and tools: aligned domain texts, domain terminology for a source language, a terminology extractor for a target language, and a ...aligned texts, word alignment, terminology extraction, electronic dictionaries, morphological inflection... overall design of our system (Fig- ure1) is as follows: 1. Input: • A sentence-aligned domain-specific corpus in- volving a source and a target language. We will denote an entry in this corpus with S(text.align) ↔ T (text.align); • A list of terms from the same domain in a source language (both s ...
... extracted from the target part of the aligned corpus having some expected syntac- tic structure. We will denote an entry from this list with T (term.extract). 2. Processing: • Aligning bilingual chunks (possible translation equivalents) from the aligned corpus. We will denote aligned chunks with S(align ...
... contained 491,990 translation pair candidates. We decided to enrich corpus with additional parallel lists (described in Subsection 4.4.) since we observed certain improvement in evaluations of translation quality. First we splitted corpus of aligned sentences into three disjoint parts: training (80%), ...Cvetana Krstev, Branislava Šandrih, Ranka Stanković. "Using English Baits to Catch Serbian Multi-Word Terminology" in Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European Language Resources Association (ELRA) (2018)
-
Terminology Acquisition and Description Using Lexical Resources and Local Grammars
Acquisition of new terminology from specific domains and its adequate description within terminological dictionaries is a complex task, especially for languages that are morphologically complex such as Serbian. In this paper we present an approach to solving this task semi-automatically on basis of lexical resources and local grammars developed for Serbian. Special attention is given to automatic inflectional class prediction for simple adjectives and nouns and the use of syntactic graphs for extraction of Multi-Word Unit (MWU) candidates for ...... transducers using CasSys tool incorporated in Unitex1 corpus processing platform, as well as the use of TMF standard for the representation of terms is proposed in (Ammar et al., 2015) and applied on Arabic scientific and technical corpus. In (Savary et al., 2012) terminology extraction in the ...
... ported that modern statistical Natural Language Processing (NLP) is in great need of better lan- guage models and linguistic tools must come to 1 Corpus processing System Unitex: http://www-igm.univ- mlv.fr/~unitex/ Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada ...
... extraction In order to evaluate our approach, we applied it to a collection of 74 papers in Serbian from the journal Infotheca. 6 The size of the corpus is 6 Infotheca - Journal for Digital Humanities (http://infoteka.bg.ac.rs/index.php/en/infoteca) Proceedings of the conference Terminology and ...Cvetana Krstev, Ranka Stanković, Ivan Obradović, Biljana Lazić. "Terminology Acquisition and Description Using Lexical Resources and Local Grammars" in Proceedings of the 11th Conference on Terminology and Artificial Intelligence, Granada, Spain, 2015, Granada : LexiCon (Universidad de Granada) (2015)
-
Multi-word Expressions for Abusive Speech Detection in Serbian
Ovaj rad predstavlja istraživanja na usavršavanju i unapređenju srpske verzije rečnika Hurtlex, višejezičnog leksikona uvredljivih reči. Posebnu pažnju posvećujemo dodavanju izraza sa više reči (polileksemskih jedinica) koji se mogu smatrati uvredljivim, jer su takvi leksički zapisi veoma važni za postizanje dobrih rezultata u mnoštvu zadataka otkrivanja uvredljivog jezika. Srpski morfološki rečnici se koriste kao osnova za čišćenje podataka i stvaranje rečnika. Istaknuta je veza sa drugim leksičkim i semantičkim resursima na srpskom jeziku i predviđena je izgradnja sistema za ...... the domain corpus of hateful content and Subjectivity lexicon of Therese Wilson in combination with the SentiWordNet (Esuli and Sebastiani, 2006).For clas- sification, they leveraged rules and achieved a result of F1 = 0.783 for strongly hateful sentences on a manually annotated domain corpus. Razavi ...
... Processing Paradigm for Balkan Languages, pages 15–22. Cvetana Krstev, Jelena Jaćimović, and Duško Vitas. 2020. Analysis of similes in serbian literary texts (1840- 1920) using computational methods. In Svetla Koeva, editor, Proceedings of the Fourth International Confer- ence Computational Linguistics ...
... hyperbole, litotes etc. Initial work on detecting some of these figures has been presented in (Mladenović et al., 2017; Krstev et al., 2020). Using a corpus of newspaper articles from 2006, Krstev et al. (2007) presented the results of an infor- mation search experiment in search of attacks which are the ...Ranka Stanković, Jelena Mitrović, Danka Jokić, Cvetana Krstev. "Multi-word Expressions for Abusive Speech Detection in Serbian" in Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, Association for Computational Linguistics (2020)
-
EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School
Prva škola za obuku polaznika koju je organizovala COST akcija NexusLinguarum održana je od 8. do 12. februara 2021. godine sa ciljem da studenti, istraživači i stručnjaci nauče osnove lingvističke nauke o podacima. Tokom obuke polaznici su se upoznali sa širokim spektrom tema: od semantičkog veba, RDF -a i ontologija, do modeliranja i pretraživanja jezičkih podataka pomoću najsavremenijih ontoloških modela i alata. Škola je održana u okviru serije letnjih škola EUROLAN-a i organizovalo ju je virtuelno (onlajn) nekoliko instituta; ...nauka o lingvističkim podacima, povezani podaci u lingvistici, jezički podaci, EUROLAN, NexusLinguarum, COST akcija, škola za obuku... September 2021 115 Dojchinovski M. et al., eurolan 2021: . . . Linked Data. . . , pp. 113–120 Ponsoda 2017), FrAC 12 – frequency, attestation and corpus Informa- tion (Chiarcos et al. 2020). Finally, the training school ended with a closing session where an ontology of participants, lecturers and ...
... and building on to present more specific topics in a detailed fashion on the last day, the participants had 12. FrAC – Frequency, Attestation and Corpus Information - Ontology-Lexica Community Group 116 Infotheca Vol. 21, No. 1, September 2021 Professional paper a chance to acquire a solid foundation ...
... Lex Frac module was used for representation of the entries from the lexicon used for abusive speech detec- tion with attestations from the Twitter corpus with annotation of abusive spans (Jokić et al. 2021). 3 Organization Due to the COVID-19 pandemic and current travel restrictions in Europe and beyond ...Milan Dojchinovski, Julia Bosque Gil, Jorge Gracia, Ranka Stanković. "EUROLAN 2021: Introduction to Linked Data for Linguistics Online Training School" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.7
-
From DELA Based Dictionary to Leximirka Lexical Database
Biljana Lazić, Mihailo Škorić (2020)In this paper, we will present an approach in transforming Serbian language Morphological dictionaries from a DELA text format to a lexical database dubbed Leximirka. Considering the benefits of storing data within a database when compared to storing them in textual documents, we will outline some of the functionality that the database has made possible. We will also show how hand-made rules that use category labels lexical entries are marked with can be used to link lexical entries. ...... 000 most frequent words in the Serbian Corpus of the Serbian Language SrbCorp (version of 122 million words by Duško Vitas and Miloš Utvić)6. Information about the Corpus is stored in the KorpusMeta table. The LexicalRelation table stores information 6 Corpus of the Serbian Language – SrbCorp 86 ...
... that match the specified search criteria appear as rows in the table. The registered user has access to multiple corpus searches (in the MatKorp and SrpKorpRGF corpora). The Mining Corpus (RudKorp) (Tomašević et al., 2018) that can be searched by some predefined queries that retrieve a word searched ...
... their main importance is their reusability. They were used for the basic tasks of word processing, automatic recognition 1 Unitex is cross-platform Corpus Processing Suite to retrieve data. Infotheca Vol. 19, No. 2, December 2019 81 Lazić B., Škorić M., “From DELA based dictionary to . . . ”, pp ...Biljana Lazić, Mihailo Škorić. "From DELA Based Dictionary to Leximirka Lexical Database" in Infotheca, Faculty of Philology, University of Belgrade (2020). https://doi.org/10.18485/infotheca.2019.19.2.4
-
Infotheca (Q25460443) in Wikidata
Ranka Stanković, Lazar Davidović (2021)Vikipodaci su baza znanja Zadužbine Vikimedija koja predstavlja zajednički izvor različitih vrsta podataka koje koriste ne samo drugi Vikipedijini projekti, već sve više i brojne aplikacije semantičkog veba. U ovom radu ćemo prezentovati primer integracije Vikipodataka sa digitalnim bibliotekama i eksternim sistemima, kao i mogućnost ubrzanja pripreme i unosa podataka na primeru radova iz časopisa za digitalnu humanistiku Infoteka.... open data network was used by Andonovski (Андоновски 2020) to describe lan- guage resources, namely, novels forming part of the Serbian-German literary corpus (Andonovski, Šandrih, and Kitanović 2019). For a number of years now, students at the Faculty of Mining and Geology have been undergoing training ...
... of open data. As part of the “Distant Reading for European Literary History”12 се ради на уносу метаподатака о српским романима из корпуса srpELTeC 13 COST Action CA16204 (2017-2021) metadata about Serbian novels included in the srpELTEC corpus is being entered into the knowledge base (Krstev et al. 2019) ...
... 10. Wikimedia 11. Input data to Wikidata and their use 12. One of the most important aims of this action is preparing a multilingual corpus (titled European Literary Text Collection - ELTeC) which, when fully com- plete, will feature a hundred novels from each participating country first published in ...Ranka Stanković, Lazar Davidović. "Infotheca (Q25460443) in Wikidata" in Infotheca, Faculty of Philology, University of Belgrade (2021). https://doi.org/10.18485/infotheca.2021.21.1.5
-
Resource-based WordNet Augmentation and Enrichment
In this paper we present an approach to support production of synsets for SerbianWordNet(SerWN)byadjustingPrincetonWordNet(PWN)synsetsusing several bilingual English-Serbian resources. PWN synset definitions were automatically translated and post-edited, if needed, while candidate literals for Serbian synsets were obtained automatically from a list of translational equivalents compiled form bilingual resources. Preliminary results obtained from a setof1248selectedPWNsynsetsshowthattheproducedSerbiansynsetscontain 4024 literals, out of which 2278 were offered by the system we present in this paper, whereas experts added the remaining 1746. Approximately one half of ...... wordnets. The English part of each corpus was semantically tagged, after which the process of wordnet creation was transformed into a word alignment problem, where wordnet synsets in the English part of the corpus were aligned with in the target language part of the corpus. The obtained precision was s ...
... with domain-specific single and multi- word expressions. They used a large monolingual Slovene corpus of texts to extract terminology from the domain of informatics, and a parallel English-Slovene corpus and an online dictionary as bilingual resources to facilitate the addition of new terms to sloWNet ...
... parallel resources, and search for new pairs of aligned literals for synsets, which will then be manually post-edited. We also plan to use parallel corpus based methodologies relying on two strategies proposed in ((Oliver et al., 2015)) for automatic construction of the required corpora: by machine ...Ranka Stanković, Miljana Mladenović, Ivan Obradović, Marko Vitas, Cvetana Krstev. "Resource-based WordNet Augmentation and Enrichment" in Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018), May 27-29, 2018, Sofia, Bulgaria, Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences (2018)
-
Transformer-Based Composite Language Models for Text Evaluation and Classification
Parallel natural language processing systems were previously successfully tested on the tasks of part-of-speech tagging and authorship attribution through mini-language modeling, for which they achieved significantly better results than independent methods in the cases of seven European languages. The aim of this paper is to present the advantages of using composite language models in the processing and evaluation of texts written in arbitrary highly inflective and morphology-rich natural language, particularly Serbian. A perplexity-based dataset, the main asset for the ...Mihailo Škorić, Miloš Utvić, Ranka Stanković. "Transformer-Based Composite Language Models for Text Evaluation and Classification" in Mathematics, MDPI AG (2023). https://doi.org/10.3390/math11224660
-
Advancing Sentiment Analysis in Serbian Literature: A Zero and Few-Shot Learning Approach Using the Mistral Model
Ova studija predstavlja analizu sentimenta srpskih starih romana iz perioda 1840-1920, koristeći veliki jezički model (LLM) Mistral za tehniku učenja sa zasnovani na takozvanim "zero" i "few-shot" pokušajima. Glavni pristup uvodi inovacije osmišljavanjem istraživačkih upita (promptova) uključuju tekst sa uputstvom za klasifikaciju bez primera i na osnovu nekoliko primera, omogućavajući jezičkom modelu da klasifikuje osećanja u pozitivne, negativne ili objektivne kategorije. Ova metodologija ima za cilj da pojednostavi analizu osećanja ograničavanjem odgovora, čime se povećava preciznost ...Milica Ikonić Nešić, Saša Petalinkar, Mihailo Škorić, Ranka Stanković, Biljana Rujević. "Advancing Sentiment Analysis in Serbian Literature: A Zero and Few-Shot Learning Approach Using the Mistral Model" in In Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024), BAS (2024)
-
The Nooj System as Module within an Integrated Language Processing Environment
... lex-resources to texts”) then syntactic resources should not be chosen, and if the last option is on (“Apply query to corpus”), then the user selects only a query and a corpus. Figure 12 presents results in the form of concordances for the query: kompjuter, which was automatically expanded with ...
... retrieval and related areas. If query is further combined with ILI, a multilingual wordnet pivot, the possibility of searching text resources (web, corpus, text) in different languages with a single query is opened. NooJ supports morphological query expansion and expansion of queries by graphs and ...
... in information retrieval and related areas. Combined with the wordnet ILI, this approach opens the possibility of searching text resources (web, corpus, text) in different languages with a single query. Powerful linguistic tools such as NooJ, though inherently multilingual since resources for ...Ranka Stanković, Duško Vitas, Cvetana Krstev. "The Nooj System as Module within an Integrated Language Processing Environment" in Proceedings of the 2007 International Nooj Conference, Cambridge Scholars Publishing (2008)
-
Frequency and Length of Syllables in Serbian
Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová (2019)Basic analyses of several properties of syllables (the rank-frequency distribution, the distribution of length, and the relation between length and frequency) in Serbian is presented. The syllabification algorithm used combines the maximum onset principle and the sonority hierarchy. Results indicate that syllables behave similarly to words as far as mathematical models are concerned, but values of parameters in models for syllables are quite different from those for words.... onsets and codas. If one follows his modification, a large enough corpus is needed to perform statistical tests, based on which a decision on the (non-) marginality of a particular consonant cluster is made. Finding or creating such a corpus can be problematic for minor languages (such as e.g. Lower and ...
... socialist realist novel “Kak zakalyalas’ stal’” (How the Steel Was Tempered) by N. Ostrovsky. The choice is motivated by the fact that a parallel corpus consisting of the first ten chapters of the novel and their translations to all standard Slavic languages (except for Lower Sorbian) is available ...
... for Croatian), or using the approach suggested by Pulgram (1970) and modified by Lehfeldt (1971), with its drawback of needing a sufficiently large corpus (Kelih & Mačutek, 2013, for Russian and Slovene), or not at all (because the mean syllable length in words was sufficient for the purposes of the ...Marija Radojičić, Biljana Lazić, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Ján Mačutek, Lívia Leššová. "Frequency and Length of Syllables in Serbian" in Glottometrics (2019)
-
Part of Speech Tagging for Serbian language using Natural Language Toolkit
Ranka Stanković, Boro Milovanović (2020)Dok se razvijaju složeni algoritmi za NLP (obrada prirodnog jezika), osnovni zadaci kao što je označavanje ostaju veoma važni i još uvek izazovni. NLTK (Natural Language Toolkit) je moćna Python biblioteka za razvoj programa zasnovanih na NLP-u. Pokušavamo da iskoristimo ovu biblioteku za kreiranje PoS (vrsta reči) oznake za savremeni srpski jezik. Jedanaest različitih modela je kreirano korišćenjem NLTK API-ja za označavanje. Najbolji modeli se transformišu sa Brill tagerom da bi se poboljšala tačnost. Obučili smo modele na označenom ...... language data Repository Area) is a project that produced multilingual corpus on law, health and education [10]. Around the world in 80 days is a novel by Jules Verne annotated during SEE-ERA.net project [11]. ELTeC (European Literary Text Collection) is a multilingual collection of the novels written ...
... HLT Group and Jerteh, Lexical resource, 2.0, 2015 [15] A. Balvet, D. Stošić, and A. Miletić, (2014). TALC-Sef a Manually- revised POS-Tagged Literary Corpus in Serbian, English and French. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 4105-4110 ...
... on South Slavic and Balkan Languages”. Scientific results of the SEE-ERA.NET pilot joint call, pp 5, Oct. 2009 [12] Distant Reading for European Literary History, a COST Action funded by the Horizon 2020 Framework. https://www.distant-reading.net/, Mar. 2020 [13] M. d. Marneffe, T. Dozat, N. Silveira ...Ranka Stanković, Boro Milovanović. "Part of Speech Tagging for Serbian language using Natural Language Toolkit" in 7th International Conference on Electrical, Electronic and Computing Engineering IcETRAN 2020, Academic Mind, Belgrade (2020)
-
E-Connecting Balkan Languages
In this paper we present a versatile language processing tool that can be successfully used for many Balkan languages. This tool relies for its work on several sophisticated textual and lexical resources that were developed for most of Balkan languages. These resources are based on several de facto standards in natural language processing.... 2005-02 de l’Institut Gaspard- Monge, CNRS, 2005. [4] T. Erjavec and N. Ide. The MULTEXT-East Corpus. In LREC’98, Granada, pp. 971-974, 1998. [5] A. Gelbukh, G. Sidorov, J.-A. Vera-Félix. A Bilingual Corpus of Novels Aligned at Paragraph Level. In proc. FinTAL-2006. Lecture Notes in Artificial ...
... compiled, from large corpora usually fully automatically prepared comprising from texts in some limited technical domain [18], to more versatile literary corpora [5] that are often more modest in size but minutely prepared. The main textual resource used to explore WS4LR is Jules Verne’s novel ...
... relation The results obtained by this query are very interesting and show by themselves the potential this tool offers for various linguistic and literary researches. This query retrieved 129 aligned segments, each of which contained at least one of the keywords from the produced query set in at ...Cvetana Krstev, Ranka Stanković, Duško Vitas, Svetla Koeva. "E-Connecting Balkan Languages" in Proceedings of the Workshop Workshop on Multilingual resources, technologies and evaluation for Central and Eastern European Languages, 17 September 2009, eds. C. Vertan, S. Piperidis, E. Paskaleva and Milena Slavcheva, Borovets, Bulgaria : Association for Computational Linguistics Stroudsburg, PA, USA (2009)
-
Indexing of textual databases based on lexical resources: A case study for Serbian
In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and Named Entity Recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia in the last half century. Each document within this database is described by metadata, consisting of several fields such as title, domain, keywords, abstract, geographical location and the like. A bag of words was produced from these ...... for which we could have used the TreeTagger trained for Serbian that was used for the lemmatization of the Corpus of Contemporary Serbian [16]. However, this lemmatizer was trained on a corpus that differs significantly from our collection, and additionally it does not take into account MWUs. The approach ...
... much as possible [7]. These local grammars were organized in cascades that further resolve ambiguities [10]. NER system was evaluated on a newspaper corpus and results reported in [7] showed that F -measure of recognition was 0.96 for types and 0.92 fot tokens. For the purpose of indexing, we applied ...
... Nikolić, V.: The Develop- ment of the GeolISSTerm Terminological Dictionary. INFOtheca 12(1), 49a–63a (August 2011) 16. Utvić, M.: Annotating the Corpus of contemporary Serbian. INFOtheca – Journal of Informatics & Librarianship 12(2), 36a–47a (2011) 17. Vossen, P.: EuroWordNet: a multilingual database ...Ranka Stanković, Cvetana Krstev, Ivan Obradović, Olivera Kitanović. "Indexing of textual databases based on lexical resources: A case study for Serbian" in Semantic Keyword-based Search on Structured Data Sources : First COST Action IC1302 International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8-9, 2015. Revised Selected Papers, Springer (2015). https://doi.org/10.1007/978-3-319-27932-9_15
-
Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian
Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek (2021)Biljana Rujević, Marija Kaplar, Sebastijan Kaplar, Ranka Stanković, Ivan Obradović, Jan Mačutek. "Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian" in Language and Text: Data, models, information and applications, John Benjamins Publishing Company (2021). https://doi.org/10.1075/cilt.356.04ruj
-
Речници у дигиталном добу - информатичка подршка за српски језик
Биљана Рујевић (2022)Морфолошки речници српског језика представљају електронски језички ресурс који има значајну историју развоја и коришћења за потребе обраде природних језика. С обзиром на то да су чувани у облику датотека чији је број нарастао па је самим тим управљање речницима постало отежано јавила се потреба за смештањем информација из речника у облик лексикографске базе. Како би се омогућио симултани рад на развоју речника за више корисника јавила се потреба за веб-апликацијом заснованој на лексикографској бази. Како би се размотриле ...Биљана Рујевић. Речници у дигиталном добу - информатичка подршка за српски језик, Београд : [Б. Рујевић], 2022
-
Keyword Extraction from Parallel Abstracts of Scientific Publications
... extraction method. The method is based on the structural and statistical properties of text represented as a complex network. The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the con- ...
... relations as edges (links). The weight of the link is pro- portional to the overall co-ccurrence frequencies of the corresponding word pairs within a corpus. We will focus on the network construction around co-occurrence relations of adjacent words within sentences, since it requires no semantic or syn- ...
... relies on lexical resources for modeling various syntactic structures of multi-word terms. It is applied in several domains, also among them is the corpus of Serbian texts from the geology and mining domain containing more than 600,000 simple word forms. Part of this approach was the automatic elimination ...Slobodan Beliga, Olivera Kitanović, Ranka Stanković, Sanda Martinčić-Ipšić . "Keyword Extraction from Parallel Abstracts of Scientific Publications" in Sematic Keyword-Based Search on Structured Data Sources - Third International KEYSTONE Conference, IKC 2017 Gdańsk, Poland, September 11–12, 2017 Revised Selected Papers and COST Action IC1302 Reports, Springer (2017)