Processing CSV, JSON and XLS Data in Python

Last updated on Jan 22 2023
Prabhas Ramanathan

Reading data from CSV(comma separated values) is a fundamental necessity in Data Science. Often, we get data from various sources which can get exported to CSV format so that they can be used by other systems. The Panadas library provides features using which we can read the CSV file in full as well as in parts for only a selected group of columns and rows.

Table of Contents

Input as CSV File

The csv file is a text file in which the values in the columns are separated by a comma. Let’s consider the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

Reading a CSV File

The read_csv function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file.
import pandas as pd
data = pd.read_csv(‘path/input.csv’)
print (data)
When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.
id name salary start_date dept
0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance

Reading Specific Rows

The read_csv function of the pandas library can also be used to read some specific rows for a given column. We slice the result from the read_csv function using the code shown below for first 5 rows for the column named salary.
import pandas as pd
data = pd.read_csv(‘path/input.csv’)

# Slice the result for first 5 rows
print (data[0:5][‘salary’])
When we execute the above code, it produces the following result.
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64

Reading Specific Columns

The read_csv function of the pandas library can also be used to read some specific columns. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for all the rows.
import pandas as pd
data = pd.read_csv(‘path/input.csv’)

# Use the multi-axes indexing funtion
print (data.loc[:,[‘salary’,’name’]])
When we execute the above code, it produces the following result.
salary name
0 623.30 Rick
1 515.20 Dan
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab
7 722.50 Guru

Reading Specific Columns and Rows

The read_csv function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_csv(‘path/input.csv’)

# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],[‘salary’,’name’]])
When we execute the above code, it produces the following result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi

Reading Specific Columns for a Range of Rows

The read_csv function of the pandas library can also be used to read some specific columns and a range of rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_csv(‘path/input.csv’)

# Use the multi-axes indexing funtion
print (data.loc[2:6,[‘salary’,’name’]])
When we execute the above code, it produces the following result.
salary name
2 611.00 Tusar
3 729.00 Ryan
4 843.25 Gary
5 578.00 Rasmi
6 632.80 Pranab

Python – Processing JSON Data

JSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the read_json function.
Input Data
Create a JSON file by copying the below data into a text editor like notepad. Save the file with .json extension and choosing the file type as all files(*.*).

{ 
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ]
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],

"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}

Read the JSON File

The read_json function of the pandas library can be used to read the JSON file into a pandas DataFrame.
import pandas as pd

data = pd.read_json(‘path/input.json’)
print (data)
When we execute the above code, it produces the following result.
Dept ID Name Salary StartDate
0 IT 1 Rick 623.30 1/1/2012
1 Operations 2 Dan 515.20 9/23/2013
2 IT 3 Tusar 611.00 11/15/2014
3 HR 4 Ryan 729.00 5/11/2014
4 Finance 5 Gary 843.25 3/27/2015
5 IT 6 Rasmi 578.00 5/21/2013
6 Operations 7 Pranab 632.80 7/30/2013
7 Finance 8 Guru 722.50 6/17/2014

Reading Specific Columns and Rows

Similar to what we have already seen in the previous chapter to read the CSV file, the read_json function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the Salary and Name column for some of the rows.
import pandas as pd
data = pd.read_json(‘path/input.xlsx’)

# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],[‘salary’,’name’]])
When we execute the above code, it produces the following result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi

Reading JSON file as Records

We can also apply the to_json function along with parameters to read the JSON file content into individual records.
import pandas as pd
data = pd.read_json(‘path/input.xlsx’)

 

print(data.to_json(orient='records', lines=True))
When we execute the above code, it produces the following result.
{"Dept":"IT","ID":1,"Name":"Rick","Salary":623.3,"StartDate":"1\/1\/2012"}
{"Dept":"Operations","ID":2,"Name":"Dan","Salary":515.2,"StartDate":"9\/23\/2013"}
{"Dept":"IT","ID":3,"Name":"Tusar","Salary":611.0,"StartDate":"11\/15\/2014"}
{"Dept":"HR","ID":4,"Name":"Ryan","Salary":729.0,"StartDate":"5\/11\/2014"}
{"Dept":"Finance","ID":5,"Name":"Gary","Salary":843.25,"StartDate":"3\/27\/2015"}
{"Dept":"IT","ID":6,"Name":"Rasmi","Salary":578.0,"StartDate":"5\/21\/2013"}
{"Dept":"Operations","ID":7,"Name":"Pranab","Salary":632.8,"StartDate":"7\/30\/2013"}
{"Dept":"Finance","ID":8,"Name":"Guru","Salary":722.5,"StartDate":"6\/17\/2014"}

 

Python – Processing XLS Data

Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.

Input as Excel File

We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.
You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.
# Data in Sheet1

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

# Data in Sheet2

id name zipcode
1 Rick 301224
2 Dan 341255
3 Tusar 297704
4 Ryan 216650
5 Gary 438700
6 Rasmi 665100
7 Pranab 341211
8 Guru 347480

Reading an Excel File

The read_excel function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.
import pandas as pd
data = pd.read_excel(‘path/input.xlsx’)
print (data)
When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.
id name salary start_date dept
0 1 Rick 623.30 2012-01-01 IT
1 2 Dan 515.20 2013-09-23 Operations
2 3 Tusar 611.00 2014-11-15 IT
3 4 Ryan 729.00 2014-05-11 HR
4 5 Gary 843.25 2015-03-27 Finance
5 6 Rasmi 578.00 2013-05-21 IT
6 7 Pranab 632.80 2013-07-30 Operations
7 8 Guru 722.50 2014-06-17 Finance

Reading Specific Columns and Rows

Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.
import pandas as pd
data = pd.read_excel(‘path/input.xlsx’)

# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],[‘salary’,’name’]])
When we execute the above code, it produces the following result.
salary name
1 515.2 Dan
3 729.0 Ryan
5 578.0 Rasmi

Reading Multiple Excel Sheets

Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out individually.
import pandas as pd
with pd.ExcelFile(‘C:/Users/Rasmi/Documents/pydatasci/input.xlsx’) as xls:
df1 = pd.read_excel(xls, ‘Sheet1’)
df2 = pd.read_excel(xls, ‘Sheet2’)

print(“****Result Sheet 1****”)
print (df1[0:5][‘salary’])
print(“”)
print(“***Result Sheet 2****”)
print (df2[0:5][‘zipcode’])
When we execute the above code, it produces the following result.
****Result Sheet 1****
0 623.30
1 515.20
2 611.00
3 729.00
4 843.25
Name: salary, dtype: float64

***Result Sheet 2****
0 301224
1 341255
2 297704
3 216650
4 438700
Name: zipcode, dtype: int64

So, this brings us to the end of blog. This Tecklearn ‘Processing CSV, JSON and XLS Data in Python’ blog helps you with commonly asked questions if you are looking out for a job in Python Programming. If you wish to learn Python and build a career in Python Programming domain, then check out our interactive, Python with Data Science Training, that comes with 24*7 support to guide you throughout your learning period.

Python with Data Science Training

About the Course

Python with Data Science training lets you master the concepts of the widely used and powerful programming language, Python. This Python Course will also help you master important Python programming concepts such as data operations, file operations, object-oriented programming and various Python libraries such as Pandas, NumPy, Matplotlib which are essential for Data Science. You will work on real-world projects in the domain of Python and apply it for various domains of Big Data, Data Science and Machine Learning.

Why Should you take Python with Data Science Training?

• Python is the preferred language for new technologies such as Data Science and Machine Learning.
• Average salary of Python Certified Developer is $123,656 per annum – Indeed.com
• Python is by far the most popular language for data science. Python held 65.6% of the data science market.

What you will Learn in this Course?

Introduction to Python

• Define Python
• Understand the need for Programming
• Know why to choose Python over other languages
• Setup Python environment
• Understand Various Python concepts – Variables, Data Types Operators, Conditional Statements and Loops
• Illustrate String formatting
• Understand Command Line Parameters and Flow control

 

Python Environment Setup and Essentials

• Python installation
• Windows, Mac & Linux distribution for Anaconda Python
• Deploying Python IDE
• Basic Python commands, data types, variables, keywords and more

Python language Basic Constructs

• Looping in Python
• Data Structures: List, Tuple, Dictionary, Set
• First Python program
• Write a Python Function (with and without parameters)
• Create a member function and a variable
• Tuple
• Dictionary
• Set and Frozen Set
• Lambda function

OOP (Object Oriented Programming) in Python

• Object-Oriented Concepts

Working with Modules, Handling Exceptions and File Handling

• Standard Libraries
• Modules Used in Python (OS, Sys, Date and Time etc.)
• The Import statements
• Module search path
• Package installation ways
• Errors and Exception Handling
• Handling multiple exceptions

Introduction to NumPy

• Introduction to arrays and matrices
• Indexing of array, datatypes, broadcasting of array math
• Standard deviation, Conditional probability
• Correlation and covariance
• NumPy Exercise Solution

Introduction to Pandas

• Pandas for data analysis and machine learning
• Pandas for data analysis and machine learning Continued
• Time series analysis
• Linear regression
• Logistic Regression
• ROC Curve
• Neural Network Implementation
• K Means Clustering Method

Data Visualisation

• Matplotlib library
• Grids, axes, plots
• Markers, colours, fonts and styling
• Types of plots – bar graphs, pie charts, histograms
• Contour plots

Data Manipulation

• Perform function manipulations on Data objects
• Perform Concatenation, Merging and Joining on DataFrames
• Iterate through DataFrames
• Explore Datasets and extract insights from it

Scikit-Learn for Natural Language Processing

• What is natural language processing, working with NLP on text data
• Scikit-Learn for Natural Language Processing
• The Scikit-Learn machine learning algorithms
• Sentimental Analysis – Twitter

Introduction to Python for Hadoop

• Deploying Python coding for MapReduce jobs on Hadoop framework.
• Python for Apache Spark coding
• Deploying Spark code with Python
• Machine learning library of Spark MLlib
• Deploying Spark MLlib for Classification, Clustering and Regression

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "Processing CSV, JSON and XLS Data in Python"

Leave a Message

Your email address will not be published. Required fields are marked *