It’s January again, named after the two-faced god that looks both backwards and forwards. Rather than simply reviewing a year’s list of industry events (we have collected them all here), let’s look more closely at the challenges ahead. We’ll need to address at least three if we want to grasp technology innovation and market opportunities in 2018 and beyond.

1. NLP as a service

The key challenge facing this still-young LT industry is the fact that traditional baseline language technology (aka NLP) has now been outsourced to platforms in the cloud owned by the major web corporates (GAFAM) or open-source distributors. Especially over the past few months. This has led to commoditising NLP (or more broadly NLU – natural language understanding covering speech, text and multilingual knowledge). This means that in-domain start-ups in almost any field from news production through eDiscovery to language teaching can now build innovative solutions using NLP far more quickly, cheaply and easily than in the recent past.

Google’ (Parsey McParseface), Amazon (Comprendo and Lex), and Microsoft (Cognitive Services) have all packaged various NLP components or chatbot kits into paying SaaS offerings, drawing on their huge hardware resources to process data for users as needed. At the same time, GitHub and similar organisations offer a wide range of NL software that anyone can download to equip a start-up in legal- or fin-tech, marketing, healthcare, or consumer gadgets using chatbot, sentiment analysis, smart search, summarization, generation or translation resources.

In due course, we can expect similar offerings from Chinese web majors or a Russian company such as Yandex, possibly promoting specific language expertise. And Indian firms will soon corner the market in multilingual business and consumer applications for the subcontinent.

In the case of translation, the massive increase in research over the last 12 months (just check out and the availability of various neural machine-translation software packages has led to feverish conversations in the translation industry on the “unreasonable effectiveness” of engines driven by machine learning, and the future role of human translators. More of this later in another blog.

For the LT industry, this commoditisation could be a benefit for some – i.e. agile (new) players can rapidly make use of new software advances to innovate around a new product – but a threat to many: more domain-specific start-ups will tend to compete for the same pool of end-customers, and may fail to evolve into robust companies. Moreover, these commoditised packages will need to stay open enough to scale to the vastness of language practices around the world. Lots of languages and different modalities.

Overall, then, the challenge will be to step up industry innovation to a much richer level than basic NLP packages. This will enable more suppliers to compete more effectively on less generic, higher-value use-cases and domains. And in the B2B marketplace as a whole, there is plenty of interest in “AI” solutions involving scaled-up NLP in a broad range of new business intelligence and robot process automation applications. 

2. Don’t forget data bias!

The new wave of open-source software at a minimal cost only serves to underline the critical role played by data in the new machine-learning paradigm. Democratise your software by all means, but you’ll end up creating a new market greedy for the data needed to build products and solutions.

The whole AI trend is of course largely inspired by the fairly sudden realization that big data could be a strategic and competitive asset for alert businesses. Hence the rapid explosion in the last five years of machine learning tech whose roots in fact go back decades. In a nutshell, data enables the software to learn from patterns, and what it learns is how to statistically predict certain outcomes. But there’s a constant risk of “biased data in, biased learning out” affecting all ML outcomes. Does this matter in the LT field?

For language applications, data tends to be formal features: phonemes, letters, words and phrases. For the software, these items do not constitute meanings with semantic values that can be used for identification, inference and argument. So in this sense, any bias in language data will have little critical or moral impact on the predicted outcomes in NLP applications.

There is, however, a subtler but deeper bias in data collection in the LT industry: it can be found in the fact that the available data may be incapable of covering the inherent complexity of language structures in use.

This refers to the different levels (literary, formal, slang, etc.), discourse types (long form, technical, journalism, social media, conversational, etc) and modalities (spoken, code-switched, phone, radio chat, signing, subtitles, etc.) of any natural language relevant to a digital use-case. To misquote Wittgenstein, the limits of my language data are the limits of my conversational/analytic/translation app. This will need rapid repair if we want to build rich, adaptable and intelligent products and services that people need.

So chasing both the long tail of languages and the complex spread of individual linguistic “styles” will have to be the next major project for the industry. This will be all the more necessary as target content become increasingly personalised, and also because user language apps will need to be better able to interpret conversations socially, chronologically (scaling from very old to very young speakers) and psychologically (emotion, sentiment, body state, communication strategy).

By using information about these kinds of data, applications should be able to reach out, learn and predict actions and reactions from a much fuller range of human expression, understand what is being said or intended in a given communication event, and then be able to adapt language-based services more closely to speakers’ and listeners’ needs. This kind of intelligence will require far more than standard machine-learning techniques using word vectors!

3. Can we parse semantics?

Indeed, the elephant in the room when we talk about language and machines is always going to be how we characterise or process meaning. So far, data science has tried to avoid the question of how language components mean things; instead it has focused on word patterns and their distributional frequencies rather than the intensional properties of words and expressions.

There is, of course, a range of tools in the web’s “language layer”, from WordNet to BabelNet, that can be used to disambiguate words and assign appropriate semantic tags or links to items to help in search, translation, summarisation and related applications. Plus several companies devoted to developing knowledge graphs and other semantic aids to improve business (or even military) intelligence. But so far, no one has given any useful insight into how the machine learning of data properties can systematically interpret or help express nuances of meaning. Or how we can draw on digital semantic knowledge to improve machine-learnt language output.

There are some fascinating recent experiments which might point to getting a network to learn enough about related bodies of data to talk about the emergence of “concepts” underlying them. But as a rule, such networks cannot tell us anything about sentence or utterance meanings, or whether a pun could make a snappy advertising slogan.

So we need more research into the data/semantics interface, and how machine learning can help innovate around new models of meaning prduction. The aim will be for machines to somehow learn (what humans understand about) the meanings of language elements and use this to predict potential responses to and other implications based on what is said or written. That sounds like a full agenda for 2018 and no doubt many more years to come!

PS. In a second blog, we shall look at two specific markets for language technology and how they could evolve.

by Andrew Joscelyne, Senior Advisor, LT-Innovate