Speech and language are central to human intelligence, communication, and cognitive processes. Understanding natural language is often viewed as the greatest AI challenge—one that, if solved, could take machines much closer to human intelligence. 

In 2019, Microsoft and Alibaba announced that they had built enhancements to a Google technology that beat humans in a natural language processing (NLP) task called reading comprehension.  This news was somewhat obscure, but I considered this a major breakthrough because I remembered what had happened four years earlier.

In 2015, researchers from Microsoft and Google developed systems based on Geoff Hinton’s and Yann Lecun’s inventions that beat humans in image recognition.  I predicted at the time that computer vision applications would blossom, and my firm made investments in about a dozen companies building computer-vision applications or products. Today, these products are being deployed in retail, manufacturing, logistics, health care, and transportation. Those investments are now worth over $20 billion.

So in 2019, when I saw the same eclipse of human capabilities in NLP, I anticipated that NLP algorithms would give rise to incredibly accurate speech recognition and machine translation, that will one day power a “universal translator” as depicted in Star Trek.  NLP will also enable brand-new applications, such as a precise question-answering search engine (Larry Page’s grand vision for Google) and targeted content synthesis (making today’s targeted advertising child’s play).  These could be used in financial, health care, marketing, and consumer applications. Since then, we’ve been busy investing in NLP companies. I believe we may see a greater impact from NLP than computer vision.

What is the nature of this NLP breakthrough?  It’s a technology called self-supervised learning.  Prior NLP algorithms required gathering data and painstaking tuning for each domain (like Amazon Alexa, or a customer service chatbot for a bank), which is costly and error-prone. But self-supervised training works on essentially all the data in the world, creating a giant model that may have up to several trillion parameters.  

This giant model is trained without human supervision—an AI “self-trains” by figuring out the structure of the language all by itself. Then, when you have some data for a particular domain, you can fine-tune the giant model to that domain and use it for things like machine translation, question answering, and natural dialog. The fine-tuning will selectively take parts of the giant model, and it requires very little adjustment.  This is somewhat akin to how humans first learn a language and then, on that basis, learn specific knowledge or courses. 

Since the 2019 breakthrough, we have seen giant NLP models increase rapidly in size (about 10 times per year), with corresponding performance improvements.  We have also seen amazing demonstrations—such as GPT-3, which could write in anybody’s style (such as Dr. Seuss-style), or Google Lambda, which converses naturally in human speech, or a Chinese startup called Langboat that generates marketing collateral differently for each person.

Are we about to crack the natural language problem? Skeptics say these algorithms are merely memorizing the whole world’s data, and are recalling subsets in a clever way, but have no understanding and are not truly intelligent. Central to human intelligence are the abilities to reason, plan, and be creative.