Miloš Jakubíček

Lexical Computing

Miloš Jakubíček

Lexical Computing

Biography

Miloš Jakubíček is the Chief Executive Officer (CEO) of Lexical Computing, a research company working in the area of language technologies, primarily at the intersection of corpus and computational linguistics and computer lexicography. By profession, he is an NLP researcher and software engineer. His research interests are devoted mainly to two fields: effective processing of very large text corpora and the parsing of morphologically rich languages. Since 2008, Miloš has been involved in the development of Lexical Computing’s flagship product, the Sketch Engine corpus management suite. Since 2011, he has been Director of the Czech branch of Lexical Computing leading the local development team of Sketch Engine and he became CEO of Lexical Computing in 2014. Miloš is also a fellow of the NLP Centre at Masaryk University, where his interests lie mainly in morphosyntactic analysis and its practical applications.

OneClick Terms: extracting terminology out of the Sketch Engine box

Terminology extraction has been part of Sketch Engine since 2014 (see Kilgarriff et al., 2014). It was based on a contrastive approach and implemented as a corpus function. While the corpus-based approach was a big advantage in terms of performance, it also required users to understand the concept of corpora and corpus building. To ease adoption of the technology, we developed terminology extraction as a standalone product (OneClick Terms) which works out of the box — in this case, out of the Sketch Engine box — and saves the users from the need of corpus building. Alongside with this, the new product comes with bilingual terminology extraction from unaligned documents (i.e. mere translations) that is not part of Sketch Engine and a set of improved term grammars for selected languages that have been improved to increase their coverage.
In the talk I will explain the motivation behind OneClick terms, the methodology and NLP techniques used for both mono- and bilingual term extraction and alignment, and finally discuss evidence-based development of the language-specific term grammars used by the system.