Simply put, a word vector is the vector representation of a word. i.e: The representation or embedding of a word as n-dimensional vector is the word vector for that word.
A human, can very easily understand that the words “pen” and “pencil” are used in the same context or that they fall in the same category. This is because humans imbibe a lot of knowledge of the language that they gather by reading, talking and understand that both the words pen and pencil belong in the same context. Now, our objective is to teach a machine that both words are semantically similar. This is where word vectors come in.
In 2013, Mikolov et al published a paper Efficient Estimation of Word Representations in Vector Space in which they describe a set of elegant algorithms to obtain vector representations of word that reflect semantic knowledge. These methods are called word2vec and has changed the way NLP researchers dealt with text.
The data: word2vec is a kind of unsupervised algorithm (as in no manual labelling of data is required). The training data for wordvec is nothing but large amount of text. In general word2vec models are trained on wikipedia data as it is a free and open source for well written large amounts of text.
The training TLDR: To explain in layman terms, word2vec looks at the surrounding n words of each word present in our data and learns it’s context.
Word2vec at the beginning of the training assigns a random vector for each word in the data and in each iteration it “pushes” the vector of a word closer to the vectors of the surrounding words.
Consider the two sentences, “Batman is a superhero” and “Superman is the best superhero in the world.” Since, the vectors of each word are “pushed” closer to the words surrounding it, even though “Batman” and “Superman” doesn’t occur nearby in our data their vector representations will have higher similarity because they occur in similar context in our large corpus of text.
Note: Since, I want to focus on more practical aspects of dealing with text the above is a very generalized, simple explanation of the word2vec algorithm that does not cover the different
types of word2vec algorithms like CBOW, Skip Gram etc. For an in-depth explanation please take a look at the following resources. https://www.knime.com/blog/word-embedding-word2vec-explained, http://www.1-4-5.net/~dmm/ml/how_does_word2vec_work.pdf, https://rare-technologies.com/making-sense-of-word2vec/, https://projector.tensorflow.org/ for visualizing the word similarities and the word map generated by word2vec.
The vectors generated by word2vec have some very interesting properties. The most obvious property of word embeddings is that the embeddings of semantically similar words have higher similarity value. One other interesting property is that the vectors corresponding to the “important” words present in the corpus have high magnitude compared to vectors of unimportant or stop words. This property is very useful while analysing unstructured text, as we can see the significant words and their semantic equals easily. See the most significant words of the Harry Potter books here.
Gensim is a very useful library for Python while dealing with word vectors. The following steps show how to install gensim and train, experiment with Word2Vec model :-
Here, we used sentence tokenization function of nltk to split the raw text into sentences and passed the data to the Word2Vec model.
We use sentence tokenization function instead of Python’s readlines() function because, sometimes a single English line might be present in multiple lines in a text file. In practice, we need to follow some data cleaning and preprocessing methods to get the best results, which I will be covering in my next post.
Here, we used the default parameters provided by gensim library. See https://radimrehurek.com/gensim/models/word2vec.html for the complete list of parameters which must be tuned carefully to get the best results for any type of model.
After playing with the trained model for sometime you will observe that this model doesn’t give out any vector for the word “harry”, even though it has a vector for the word “Harry”. Here lies one of the major shortcomings of word2vec. Since, it is a word level model, it can’t generate vectors for Out of Vocabulary words and even though a human can very easily understand that ‘Harry’ and ‘harry’ are the same, this model can’t. In my next post, I will discuss about current state of art models that should overcome this and also about the practical aspects of preparing the data for training.
Note: For playing with word vectors, download pre-trained word vectors by Google from here. These are trained on the Google One Billion word corpus and will provide significantly better results than the model we trained above.
Let's collaborate to turn your business challenges into AI-powered success stories.
Get Started