Word Vectors: The Foundation of Natural Language Processing (NLP)

Deep learning has myriad applications. However, in the context of technical communication, natural language processing (NLP) is the most relevant, since it is powering the newest generations of machine translation, sentiment analysis, and voice synthesis.  Let’s have a look at the topic of NLP. I was first introduced to it during a fascinating lecture by Dr François Massion, at the University of Strasbourg, in 2017.

Word vectors – the foundation of NLP

Making linguistic data available to deep neural networks is the first challenge in any application of NLP. Here NLP practitioners often turn word vectors. A blog article entitled “The amazing power of word vectors”, helped me make sense of some of the more opaquely mathematical concepts that cannot be omitted for the sake of understanding.
The following image represents a highly simplified example of a typical word vector.
nlp word vectors
The word vector representing the word cat consists of eight variables. Each variable carries some aspect of the total meaning of the word. This indicates how complex the concept of meaning really is, and circumventing sticky philosophical debates is something NLP excels at. It does so by quantifying meaning and dealing with similarity and differences in terms of geometrical relationships.

Leveraging high vector spaces in NLP

In practice, most words can be transformed into word vectors consisting of anything between twenty-five and three hundred variables. Vectors of this type are often referred to as “multidimensional”, with each variable representing one additional dimension of meaning. A vector consisting out of three hundred variables is therefore a three hundred-dimensional vector, and for high-level NPL applications, word vectors can have up to a thousand dimensions.

To facilitate large vector calculations, individual variables are often given small float, or decimal, values to optimize processing resources and facilitate probability calculations. The word vector representing cat may look something like the following image in practice.
nlp word vector float
Some machine learning operations specify initial conditions, or predefined values for variables, however, some of the most effective unsupervised deep learning structures, most notably deep neural networks, start iterations with random variable values, eventually narrowing down the values to within an acceptable range.

In general, the higher the dimensional space a word vector occupies, the easier it becomes to find similarities between words. This is why many applications of NLP involve statistical vector operations in high dimensional vector spaces. This is especially true for older, support vector techniques, however, the same basis still holds for deep neural networks.

NLP is basically stats… on steroids

Given that human beings have great difficulty in visually representing objects occupying more than three dimensions, it is not surprising that similarities between vectors are often depicted in graphs of no more than two or three dimensions. When machine learning techniques are applied to sets of word vectors in successive iterations, the spaces between these vectors decrease, and words start congregating in clusters. Advanced statistical techniques can the be used to establish relationships between specific word vectors, based on the geometrical relationships between them, which may include distances and angles.

It’s exactly these statistical techniques that DeepL and Google Translate are using to find similarities between words from different languages, thus making machine translation possible. This is where the real magic happens, and exact techniques are often closely guarded; however, the principles behind them are on the public domain. I found the University of Stanford’s Deep Learning for Natural Language Processing course particularly insightful. Please note that the course is very technical in nature, so I would only recommend it to those who are serious about NLP.

It is worth pointing out that word vectors can be extended to incorporate whole clauses, sentences, or even paragraphs. The larger the sample set, the more complex the behaviors observed, including sentence level grammatical relationships.
The topics of machine learning and NLP are exhaustive and far, far beyond the scope of a single blog post. More posts on machine learning are set to appear in the months ahead, so stay tuned.

Written by Willem Beckmann

Sources: