From Dictionaries to Lexical Data: 'We treat each language as if it were the center of the universe'

Mon, 04/11/2019 - 11:23
/aj@lt-innovate.org

Andrew Joscelyne talked to Ilan Kernerman, CEO of K Dictionaries, winner of the LT-Innovate Award 2019, about Lexicala, KD’s multi-layer lexicographic resources for multiple languages.

How long has Lexicala existed? Where does it come from? What exactly is your key product/service? And who are your customers? Who are your major competitors?

Lexicala was conceived in 2017 and registered as a trademark on March 1, 2019.
K Dictionaries (KD) has initiated this new brand name with the aim of extending beyond dictionaries to lexical data and addressing the Language Technology (LT) community at large.

Lexicala consists of KD’s multi-layer lexicographic resources for multiple languages, in XML, JSON and JSON-LD (RDF) formats, available via a REST API – https://api.lexicala.com/ (or for direct delivery as data batches).
We target three new user groups: the first is researchers in computational linguistics; the second is developers of mobile apps and word games; the third, and most intriguing, is SMEs and large corporations seeking multilingual lexical data for AI, Semantic Web and Big Data applications.

Currently our competitors are mainly dictionary API’s, the most prominent of which is Oxford University Press, who has one of the oldest and strongest brands in the dictionary world and whose language selection is nearly as extensive as ours. However, we expect the habit of looking for a Dictionary API to evolve and are designing the Lexicala service as a Lexical Data API that can cater to much broader LT needs besides looking up conventional dictionary content.

Dictionary content could theoretically find a place in any language-based technology application, from translation through NLP to speech tech. Which are the most important application fields today for you and which new domains are likely to open up in the near future?

The most important application fields so far are in Machine Translation, localization, and language learning. We see growing interest in a wide range of Natural Language Processing and Understanding implementations, mainly for information retrieval and knowledge extraction, word sense induction and disambiguation, sentiment analysis, text to speech and speech to text, and morphological analyzers.

Your clients work in the following domains: can you explain how you aid these sectors?
- software and technology companies
- mobile app developers
- digital and print publishers
- online dictionary websites
- language learning providers
- natural language processing integrators

To explain this, I should first go into some detail about our background and methodologies.
Over the years we have changed our strategy from compiling an individual dictionary product to developing a graded linguistic network per language (composed of monolingual, bilingual and multilingual dimensions), enabling cross-linking to other language networks and interoperability with exterior data sources.

Practically, this means each language is treated as if it were the center of the universe, by deciphering and mapping its DNA, identifying and classifying components, adding relevant semantic and syntactic information, and representing it all comprehensively in smart data structures. All the language sets adhere to the same technical infrastructure and the same editorial framework. As a result, rather than being English-dominated, each language “speaks its own language” within a single coherent global umbrella, and its components can be utilized in diverse ways and adapted to suit targeted audiences, including the addition of translation equivalents in any other language and to as many languages as desired.

This approach applies primarily to our Global series, that was launched in 2005 and so far includes the cores of 25 European and Asian languages, with about a hundred bilingual pairs and numerous multilingual combinations. In addition, we have Password English learner’s dictionary at the heart of an English multilingual set including nearly fifty languages, and Random House Webster’s College Dictionary which is a legacy source of American English.
The data is originally developed in a fine XML format. In 2014 we began to convert it to RDF, for use with linked data technologies, and finally it is now compatible with the new state-of-the-art lexicog module of the W3C de-facto standard, Ontolex-lemon, in JSON-LD format.

Our customers have traditionally consisted of leading dictionary publishers worldwide, such as most recently PONS in Germany, to whom we furnished a modern German/Arabic dictionary along with an online version, or Cambridge University Press which includes dozens of our English bilingual learner’s dictionaries on their website to enhance its offering to foreign learners. Others might incorporate the dictionaries in language learning software, e.g., Vitec MV in the Scandinavian market; in translation services, e.g., Ordbogen.com; as mobile apps, e.g., Paragon; or online, e.g. Reverso, particularly with French bilingual dictionaries, or Naver, for whom we are creating new Korean trilingual dictionaries.
Now, with Lexicala, it is easier for a new generation of software and technology companies to access our content for multiple languages in diverse forms and formats, whether as individual lexical elements – such as alternative scripts, phonetic transcription, inflected forms, sense disambiguators, definitions, examples of usage, multiword units, synonyms, antonyms, subject domain, register, geographical and biographical names, grammatical gender and number, and of course translation equivalents – or by obtaining a comprehensive perspective of a single language, a language pair or a multi-language network. Lexicala API is also being integrated in two H2020 projects we participate in – Lynx and Elexis.

Lexicala API is unique in allowing unlimited access to all available resources – the only such service that grants access to multilingual data in a single API call – letting users either select one language or use all available languages for their purpose, without having to perform extra calls or extend user permissions. It also provides flexible search criteria and relies on robust morphological lists as well as on automatic tools for stemming, that enable searching for inflections in addition to the lemma.

How does machine learning aid you, and how do you see its future usefulness?

We use deep learning particularly for word sense disambiguation / entity linking and through neural machine translation services – with results being refined manually by our relevant experts – and have plans to increase this to many more tasks concerned with data generation, such as identifying multiword units, detecting neologisms, and finding and correcting punctuation (e.g., how are you >> How are you?).

How do you begin on a completely new Language? Which languages are you planning for next, and how do you see the future of language coverage?

We usually start with two parallel tasks: creating the editorial style guide and the headword list.
The style guide spells out minutely the language-specific editorial concept and regulations for the entry compilation, which must adhere also to generic linguistic rules. The language-specific guidelines are compiled by the editor-in-chief and specify the entry structure, the types of elements to be included for words of different grammatical categories, the approach to multiword units (MWUs), the type of usage examples, etc.; the general guidelines are compiled in-house and concern two main topics:

- sense indicators: a set of semantic elements whose role is to point out the intended meaning in the most concise way. These are aimed to be used in dictionaries for native/proficient speakers of the source language, who don’t need definitions, but suffice with just ‘hints’ for the appropriate meaning.
- treatment of problematic terms: under this group are considered both informal terms and any term whose very use might cause negative feelings. These guidelines are aimed to confine the use of such terms, by describing the specific contexts in which they may be used and stating the rules for labeling them as such.

The wordlist is compiled by auto-generating a frequency-based list of potential candidates (and parts of speech) to be included as headwords. For that list to include the most commonly used and important words in a given language, it is necessary to rely on large text corpora, so we have developed an in-house tool that connects to a corpus query system to extract such words. These days we are looking into developing new capacities to help with the automatic detection of MWUs in corpora (for inclusion in the corpus-derived frequency list).

The candidate list is reviewed by the chief editor, including lemmatizing the items, amending the POS and spotting redundancies (typos, words from other languages, etc.). The selected headwords and POS are implemented into ‘skeleton’ entries, which means we add the necessary empty tags for the editors to fill in. Then, the compilation of entries begins from scratch.

Our current resources already cover a good number of languages, so at present we rather focus on updating and upgrading these existing language cores and their multilingual sets. Initiating work on a new language usually depends on specific projects with our partners, although to a minor extent we might undertake a new language due to circumstances connected to a student intern as part of our cooperation with some universities – at the moment it seems that the next language in this context is going to be Valencian.

We put an emphasis on reinforcing our automatic data generation processes and minimizing manual editing and crafting by human experts, to widen and deepen their scope and coverage, and to expand the cross-linking of different language networks in-house and with external resources.

Can you tell us what benefits you expect from membership in LT-Innovate?

LTI has played a major role in the evolution of the LT industry over the last decade, particularly in Europe, and we expect membership to improve our understanding of this domain, help us to network and establish useful contacts with other stakeholders, and enhance our awareness of current trends and preparation for upcoming ones, such as the emphasis on AI.

From Dictionaries to Lexical Data: 'We treat each language as if it were the center of the universe'

Search form