Word Tokenization in Python

Last updated on Dec 13 2021
Sankalp Agarwal

Table of Contents

Word Tokenization in Python

Word tokenization is that the process of splitting an outsized sample of text into words. this is often a requirement in tongue processing tasks where each word must be captured and subjected to further analysis like classifying and counting them for a specific sentiment etc. The tongue Tool kit(NLTK) may be a library wont to achieve this. Install NLTK before proceeding with the python program for word tokenization.

conda install -c anaconda nltk

Next we use the word_tokenize method to separate the paragraph into individual words.

import nltk
word_data = "It originated from the thought that there are readers preferring learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

When we execute the above code, it produces the subsequent result.

[‘It’, ‘originated’, ‘from’, ‘the’, ‘idea’, ‘that’, ‘there’, ‘are’, ‘readers’,

‘who’, ‘prefer’, ‘learning’, ‘new’, ‘skills’, ‘from’, ‘the’,

‘comforts’, ‘of’, ‘their’, ‘drawing’, ‘rooms’]

Tokenizing Sentences

We can also tokenize the sentences during a paragraph like we tokenized the words. We use the tactic sent_tokenize to realize this. Below is an example.

import nltk
sentence_data = "Sun rises within the east. Sun sets within the west."
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

When we execute the above code, it produces the subsequent result.

[‘Sun rises within the east.’, ‘Sun sets within the west.’]

Python – Stemming and Lemmatization

In the areas of tongue Processing we encounter situation where two or more words have a standard root. for instance , the three words – agreed, agreeing and agreeable have an equivalent root agree. an enquiry involving any of those words should treat them because the same word which is that the root . So it becomes essential to link all the words into their root . The NLTK library has methods to try to to this linking and provides the output showing the basis word.

The below program uses the Porter stemmer for stemming.

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
word_data = "It originated from the thought that there are readers preferring learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
When we execute the above code, it produces the subsequent result.
Actual: It Stem: It
Actual: originated Stem: origin
Actual: from Stem: from
Actual: the Stem: the
Actual: idea Stem: idea
Actual: that Stem: that
Actual: there Stem: there
Actual: are Stem: are
Actual: readers Stem: reader
Actual: who Stem: who
Actual: prefer Stem: prefer
Actual: learning Stem: learn
Actual: new Stem: new
Actual: skills Stem: skill
Actual: from Stem: from
Actual: the Stem: the
Actual: comforts Stem: comfort
Actual: of Stem: of
Actual: their Stem: their
Actual: drawing Stem: draw
Actual: rooms Stem: room

Lemmatization is analogous ti stemming but it brings context to the words.So it goes a steps further by linking words with similar aiming to one word. for instance if a paragraph has words like cars, trains and automobile, then it’ll link all of them to automobile. within the below program we use the WordNet electronic database for lemmatization.

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word_data = "It originated from the thought that there are readers preferring learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))
When we execute the above code, it produces the subsequent result.
Actual: It Lemma: It
Actual: originated Lemma: originated
Actual: from Lemma: from
Actual: the Lemma: the
Actual: idea Lemma: idea
Actual: that Lemma: that
Actual: there Lemma: there
Actual: are Lemma: are
Actual: readers Lemma: reader
Actual: who Lemma: who
Actual: prefer Lemma: prefer
Actual: learning Lemma: learning
Actual: new Lemma: new
Actual: skills Lemma: skill
Actual: from Lemma: from
Actual: the Lemma: the
Actual: comforts Lemma: comfort
Actual: of Lemma: of
Actual: their Lemma: their
Actual: drawing Lemma: drawing
Actual: rooms Lemma: room

So, this brings us to the end of blog. This Tecklearn ‘Word Tokenization in Python’ blog helps you with commonly asked questions if you are looking out for a job in Python Programming. If you wish to learn Python and build a career in Data Science domain, then check out our interactive, Python with Data Science Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/python-with-data-science/

Python with Data Science Training

About the Course

Python with Data Science training lets you master the concepts of the widely used and powerful programming language, Python. This Python Course will also help you master important Python programming concepts such as data operations, file operations, object-oriented programming and various Python libraries such as Pandas, NumPy, Matplotlib which are essential for Data Science. You will work on real-world projects in the domain of Python and apply it for various domains of Big Data, Data Science and Machine Learning.

Why Should you take Python with Data Science Training?

  • Python is the preferred language for new technologies such as Data Science and Machine Learning.
  • Average salary of Python Certified Developer is $123,656 per annum – Indeed.com
  • Python is by far the most popular language for data science. Python held 65.6% of the data science market.

What you will Learn in this Course?

Introduction to Python

  • Define Python
  • Understand the need for Programming
  • Know why to choose Python over other languages
  • Setup Python environment
  • Understand Various Python concepts – Variables, Data Types Operators, Conditional Statements and Loops
  • Illustrate String formatting
  • Understand Command Line Parameters and Flow control

Python Environment Setup and Essentials

  • Python installation
  • Windows, Mac & Linux distribution for Anaconda Python
  • Deploying Python IDE
  • Basic Python commands, data types, variables, keywords and more

Python language Basic Constructs

  • Looping in Python
  • Data Structures: List, Tuple, Dictionary, Set
  • First Python program
  • Write a Python Function (with and without parameters)
  • Create a member function and a variable
  • Tuple
  • Dictionary
  • Set and Frozen Set
  • Lambda function

OOP (Object Oriented Programming) in Python

  • Object-Oriented Concepts

Working with Modules, Handling Exceptions and File Handling

  • Standard Libraries
  • Modules Used in Python (OS, Sys, Date and Time etc.)
  • The Import statements
  • Module search path
  • Package installation ways
  • Errors and Exception Handling
  • Handling multiple exceptions

Introduction to NumPy

  • Introduction to arrays and matrices
  • Indexing of array, datatypes, broadcasting of array math
  • Standard deviation, Conditional probability
  • Correlation and covariance
  • NumPy Exercise Solution

Introduction to Pandas

  • Pandas for data analysis and machine learning
  • Pandas for data analysis and machine learning Continued
  • Time series analysis
  • Linear regression
  • Logistic Regression
  • ROC Curve
  • Neural Network Implementation
  • K Means Clustering Method

Data Visualisation

  • Matplotlib library
  • Grids, axes, plots
  • Markers, colours, fonts and styling
  • Types of plots – bar graphs, pie charts, histograms
  • Contour plots

Data Manipulation

  • Perform function manipulations on Data objects
  • Perform Concatenation, Merging and Joining on DataFrames
  • Iterate through DataFrames
  • Explore Datasets and extract insights from it

Scikit-Learn for Natural Language Processing

  • What is natural language processing, working with NLP on text data
  • Scikit-Learn for Natural Language Processing
  • The Scikit-Learn machine learning algorithms
  • Sentimental Analysis – Twitter

Introduction to Python for Hadoop

  • Deploying Python coding for MapReduce jobs on Hadoop framework.
  • Python for Apache Spark coding
  • Deploying Spark code with Python
  • Machine learning library of Spark MLlib
  • Deploying Spark MLlib for Classification, Clustering and Regression

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Word Tokenization in Python"

Leave a Message

Your email address will not be published. Required fields are marked *