101 Practical Natural Language Processing (NLP) Techniques -Reckonsys

Natural Language Processing is the art of manipulating unstructured free form text. In order to work on almost all famous NLP tasks such as Sentiment Analysis, Semantic Search, Named Entity Recognition etc, a basic understanding of word vectors is required. This post focuses on the practical aspects of word vectors i.e:

What is a word vector?
Why are they significant
How to get word vectors?

What is a word vector?

Simply put, a word vector is the vector representation of a word. i.e: The representation or embedding of a word as n-dimensional vector is the word vector for that word.

Why are they significant?

A human, can very easily understand that the words “pen” and “pencil” are used in the same context or that they fall in the same category. This is because humans imbibe a lot of knowledge of the language that they gather by reading, talking and understand that both the words pen and pencil belong in the same context. Now, our objective is to teach a machine that both words are semantically similar. This is where word vectors come in.

In 2013, Mikolov et al published a paper Efficient Estimation of Word Representations in Vector Space in which they describe a set of elegant algorithms to obtain vector representations of word that reflect semantic knowledge. These methods are called word2vec and has changed the way NLP researchers dealt with text.

How does word2vec work?

The data: word2vec is a kind of unsupervised algorithm (as in no manual labelling of data is required). The training data for wordvec is nothing but large amount of text. In general word2vec models are trained on wikipedia data as it is a free and open source for well written large amounts of text.

The training TLDR: To explain in layman terms, word2vec looks at the surrounding n words of each word present in our data and learns it’s context.

Word2vec at the beginning of the training assigns a random vector for each word in the data and in each iteration it “pushes” the vector of a word closer to the vectors of the surrounding words.

Consider the two sentences, “Batman is a superhero” and “Superman is the best superhero in the world.” Since, the vectors of each word are “pushed” closer to the words surrounding it, even though “Batman” and “Superman” doesn’t occur nearby in our data their vector representations will have higher similarity because they occur in similar context in our large corpus of text.

Note: Since, I want to focus on more practical aspects of dealing with text the above is a very generalized, simple explanation of the word2vec algorithm that does not cover the different

types of word2vec algorithms like CBOW, Skip Gram etc. For an in-depth explanation please take a look at the following resources. https://www.knime.com/blog/word-embedding-word2vec-explained, http://www.1-4-5.net/~dmm/ml/how_does_word2vec_work.pdf, https://rare-technologies.com/making-sense-of-word2vec/, https://projector.tensorflow.org/ for visualizing the word similarities and the word map generated by word2vec.

Properties of word embeddings generated by word2vec:

The vectors generated by word2vec have some very interesting properties. The most obvious property of word embeddings is that the embeddings of semantically similar words have higher similarity value. One other interesting property is that the vectors corresponding to the “important” words present in the corpus have high magnitude compared to vectors of unimportant or stop words. This property is very useful while analysing unstructured text, as we can see the significant words and their semantic equals easily. See the most significant words of the Harry Potter books here.

Experimenting with word2vec:

Gensim is a very useful library for Python while dealing with word vectors. The following steps show how to install gensim and train, experiment with Word2Vec model :-

First, we need some text data to train the model on. Let’s use the text from Order of Phoenix. You can download it from here. Note that, even though this is an entire book, this is still considered very less amount of text and the results will not be as good as pre-trained word2vec models trained on large amount of text like the English Wikipedia text.
Now, let’s start by downloading the data

This small tutorial needs the following packages. So, make sure you download them before starting.

After downloading nltk please run the following in python terminal to download nltk’s required datafiles.
After this is done, run the following lines one by one to train a basic Word2Vec model on the Harry Potter book.

Here, we used sentence tokenization function of nltk to split the raw text into sentences and passed the data to the Word2Vec model.

We use sentence tokenization function instead of Python’s readlines() function because, sometimes a single English line might be present in multiple lines in a text file. In practice, we need to follow some data cleaning and preprocessing methods to get the best results, which I will be covering in my next post.

Here, we used the default parameters provided by gensim library. See https://radimrehurek.com/gensim/models/word2vec.html for the complete list of parameters which must be tuned carefully to get the best results for any type of model.

After playing with the trained model for sometime you will observe that this model doesn’t give out any vector for the word “harry”, even though it has a vector for the word “Harry”. Here lies one of the major shortcomings of word2vec. Since, it is a word level model, it can’t generate vectors for Out of Vocabulary words and even though a human can very easily understand that ‘Harry’ and ‘harry’ are the same, this model can’t. In my next post, I will discuss about current state of art models that should overcome this and also about the practical aspects of preparing the data for training.

Note: For playing with word vectors, download pre-trained word vectors by Google from here. These are trained on the Google One Billion word corpus and will provide significantly better results than the model we trained above.

Technology

Amazon Web Services (AWS)

AngularJS

Elixir

Python

React Native

Node JS

React JS

Scala

Ruby on Rails

TypeScript

WordPress

CLOSE

Services

Generative AI

Custom software development

Blockchain and Web3

DevOps

UI/UX design

MVP and POC development

Data visualization and analytics

Mobile app development

Digital Marketing

Data engineering

Cloud computing

Testing and QA

CLOSE

Industry

Supply chain management software services

Manufacturing software development services

Healthcare software development services

HR Software Development

Digital marketing software development

CRM Software development

Real Estate Software Development Company

Aviation

FinTech software development services

EdTech software development services

CLOSE

Blogs

Blogs 101 Practical Natural Language Processing

Blogs

Machine Learning

101 Practical Natural Language Processing

#NLP

#tech

Machine Learning, Published On : 27 August 2018

What is a word vector?

Why are they significant?

How does word2vec work?

Properties of word embeddings generated by word2vec:

Experimenting with word2vec:

Dinesh Murali

Contact Us

Let’s collaborate

Need assistance or have questions?

4.9/5

Based on 26 client reviews

5/5

Based on 16 client reviews

4.9/5

Based on 26 client reviews

Subscribe for the latest updates and exclusive content!

India(HQ)

2/B, 19th Main Rd, Sector 3, HSR Layout, Bengaluru, Karnataka 560102

United States

300 Delaware Avenue, Wilmington, Delaware - 19801

© 2025 RECKONSYS TECH LABS

Services

Technology

Company

Services

Our Works

About Us

CSR

Blogs

Careers

Privacy Policy

Social

Discover Next-Generation AI Solutions for Your Business!

Blogs
101 Practical Natural Language Processing

2/B, 19th Main Rd, Sector 3,
HSR Layout, Bengaluru,
Karnataka 560102

300 Delaware Avenue,
Wilmington,
Delaware - 19801