Handling Excel and Binary Files in R

Last updated on Dec 13 2021
Abhinav Prakash

Table of Contents

Handling Excel and Binary Files in R

Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. R can read directly from these files using some excel specific packages. Few such packages are – XLConnect, xlsx, gdata etc. We will be using xlsx package. R can also write into excel file using this package.

Install xlsx Package

You can use the following command in the R console to install the “xlsx” package. It may ask to install some additional packages on which this package is dependent. Follow the same command with required package name to install the additional packages.

install.packages(“xlsx”)

Verify and Load the “xlsx” Package

Use the following command to verify and load the “xlsx” package.

# Verify the package is installed.

any(grepl("xlsx",installed.packages()))

# Load the library into R workspace.

library("xlsx")

When the script is run we get the following output.

[1] TRUE

Loading required package: rJava

Loading required package: methods

Loading required package: xlsxjars

Input as xlsx File

Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.

id            name      salary    start_date               dept

1             Rick             623.3    1/1/2012                IT

2             Dan       515.2     9/23/2013    Operations

3             Michelle  611             11/15/2014      IT

4             Ryan           729          5/11/2014           HR

5             Gary            43.25     3/27/2015           Finance

6             Nina            578       5/21/2013               IT

7             Simon         632.8    7/30/2013              Operations

8             Guru           722.5    6/17/2014              Finance

Also copy and paste the following data to another worksheet and rename this worksheet to “city”.

name          city

Rick             Seattle

Dan       Tampa

Michelle  Chicago

Ryan           Seattle

Gary           Houston

Nina           Boston

Simon         Mumbai

Guru           Dallas

Save the Excel file as “input.xlsx”. You should save it in the current working directory of the R workspace.

Reading the Excel File

The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame in the R environment.

# Read the first worksheet in the file input.xlsx.

data <- read.xlsx("input.xlsx", sheetIndex = 1)

print(data)

When we execute the above code, it produces the following result −

id,   name,     salary,   start_date,   dept

1      1    Rick      623.30    2012-01-01    IT

2      2    Dan       515.20    2013-09-23    Operations

3      3    Michelle  611.00    2014-11-15    IT

4      4    Ryan      729.00    2014-05-11    HR

5     NA    Gary      843.25    2015-03-27    Finance

6      6    Nina      578.00    2013-05-21    IT

7      7    Simon     632.80    2013-07-30    Operations

8      8    Guru      722.50    2014-06-17    Finance

  R – Binary Files

A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s). They are not human readable as the bytes in it translate to characters and symbols which contain many other non-printable characters. Attempting to read a binary file using any text editor will show characters like Ø and ð.

The binary file has to be read by specific programs to be useable. For example, the binary file of a Microsoft Word program can be read to a human readable form only by the Word program. Which indicates that, besides the human readable text, there is a lot more information like formatting of characters and page numbers etc., which are also stored along with alphanumeric characters. And finally a binary file is a continuous sequence of bytes. The line break we see in a text file is a character joining first line to the next.

Sometimes, the data generated by other programs are required to be processed by R as a binary file. Also R is required to create binary files which can be shared with other programs.

R has two functions WriteBin() and readBin() to create and read binary files.

Syntax

writeBin(object, con)

readBin(con, what, n )

Following is the description of the parameters used −

  • con is the connection object to read or write the binary file.
  • object is the binary file which to be written.
  • what is the mode like character, integer etc. representing the bytes to be read.
  • n is the number of bytes to read from the binary file.

Example

We consider the R inbuilt data “mtcars”. First we create a csv file from it and convert it to a binary file and store it as a OS file. Next we read this binary file created into R.

Writing the Binary File

We read the data frame “mtcars” as a csv file and then write it as a binary file to the OS.

# Read the "mtcars" data frame as a csv file and store only the columns

"cyl", "am" and "gear".

write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "",

col.names = TRUE, sep = ",")




# Store 5 records from the csv file as a new data frame.

new.mtcars <- read.table("mtcars.csv",sep = ",",header = TRUE,nrows = 5)

# Create a connection object to write the binary file using mode "wb".

write.filename = file("/web/com/binmtcars.dat", "wb")

# Write the column names of the data frame to the connection object.

writeBin(colnames(new.mtcars), write.filename)

# Write the records in each of the column to the file.

writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)

# Close the file for writing so that it can be read by other program.

close(write.filename)

Reading the Binary File

The binary file created above stores all the data as continuous bytes. So we will read it by choosing appropriate values of column names as well as the column values.

# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat", "rb")
# First read the column names. n = 3 as we have 3 columns.
column.names <- readBin(read.filename, character(),  n = 3)

# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(),  n = 18)

# Print the data.
print(bindata)

# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)

# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)

# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)

# Combine all the read values to a dat frame.
finaldata = cbind(cyldata, amdata, geardata)
colnames(finaldata) = column.names
print(finaldata)
When we execute the above code, it produces the following result and chart −

[1]    7108963 1728081249    7496037          6          6          4
[7]          6          8          1          1          1          0
[13]          0          4          4          4          3          3


[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3

cyl am gear
[1,]   6  1    4
[2,]   6  1    4
[3,]   4  1    4
[4,]   6  0    3
[5,]   8  0    3

As we can see, we got the original data back by reading the binary file in R.

So, this brings us to the end of blog. This Tecklearn ‘Handling Excel and Binary Files in R’ blog helps you with commonly asked questions if you are looking out for a job in Data Science. If you wish to learn R Language and build a career in Data Science domain, then check out our interactive, Data Science using R Language Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/data-science-training-using-r-language/

Data Science using R Language Training

About the Course

Tecklearn’s Data Science using R Language Training develops knowledge and skills to visualize, transform, and model data in R language. It helps you to master the Data Science with R concepts such as data visualization, data manipulation, machine learning algorithms, charts, hypothesis testing, etc. through industry use cases, and real-time examples. Data Science course certification training lets you master data analysis, R statistical computing, connecting R with Hadoop framework, Machine Learning algorithms, time-series analysis, K-Means Clustering, Naïve Bayes, business analytics and more. This course will help you gain hands-on experience in deploying Recommender using R, Evaluation, Data Transformation etc.

Why Should you take Data Science Using R Training?

  • The Average salary of a Data Scientist in R is $123k per annum – Glassdoor.com
  • A recent market study shows that the Data Analytics Market is expected to grow at a CAGR of 30.08% from 2020 to 2023, which would equate to $77.6 billion.
  • IBM, Amazon, Apple, Google, Facebook, Microsoft, Oracle & other MNCs worldwide are using data science for their Data analysis.

What you will Learn in this Course?

Introduction to Data Science

  • Need for Data Science
  • What is Data Science
  • Life Cycle of Data Science
  • Applications of Data Science
  • Introduction to Big Data
  • Introduction to Machine Learning
  • Introduction to Deep Learning
  • Introduction to R&R-Studio
  • Project Based Data Science

Introduction to R

  • Introduction to R
  • Data Exploration
  • Operators in R
  • Inbuilt Functions in R
  • Flow Control Statements & User Defined Functions
  • Data Structures in R

Data Manipulation

  • Need for Data Manipulation
  • Introduction to dplyr package
  • Select (), filter(), mutate(), sample_n(), sample_frac() & count() functions
  • Getting summarized results with the summarise() function,
  • Combining different functions with the pipe operator
  • Implementing sql like operations with sqldf()

Visualization of Data

  • Loading different types of datasets in R
  • Arranging the data
  • Plotting the graphs

Introduction to Statistics

  • Types of Data
  • Probability
  • Correlation and Co-variance
  • Hypothesis Testing
  • Standardization and Normalization

Introduction to Machine Learning

  • What is Machine Learning?
  • Machine Learning Use-Cases
  • Machine Learning Process Flow
  • Machine Learning Categories
  • Supervised Learning algorithm: Linear Regression and Logistic Regression

Logistic Regression

  • Intro to Logistic Regression
  • Simple Logistic Regression in R
  • Multiple Logistic Regression in R
  • Confusion Matrix
  • ROC Curve

Classification Techniques

  • What are classification and its use cases?
  • What is Decision Tree?
  • Algorithm for Decision Tree Induction
  • Creating a Perfect Decision Tree
  • Confusion Matrix
  • What is Random Forest?
  • What is Naive Bayes?
  • Support Vector Machine: Classification

Decision Tree

  • Decision Tree in R
  • Information Gain
  • Gini Index
  • Pruning

Recommender Engines

  • What is Association Rules & its use cases?
  • What is Recommendation Engine & it’s working?
  • Types of Recommendations
  • User-Based Recommendation
  • Item-Based Recommendation
  • Difference: User-Based and Item-Based Recommendation
  • Recommendation use cases

Time Series Analysis

  • What is Time Series data?
  • Time Series variables
  • Different components of Time Series data
  • Visualize the data to identify Time Series Components
  • Implement ARIMA model for forecasting
  • Exponential smoothing models
  • Identifying different time series scenario based on which different Exponential Smoothing model can be applied

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "Handling Excel and Binary Files in R"

Leave a Message

Your email address will not be published. Required fields are marked *