Data Manipulation Using Python Pandas On Bash Command Line

CSV file has become most important format. Lot of the times, as a bash user we come across data in CSV format. Of course with sed,awk and other Linux utilities, we can parse any kind of data. But when it comes to CSV files, Python Pandas is a great library for data manipulation.

In this post, I will talk about a Python utility "Pandashells" to run Pandas data manipulation commands on Bash command line. Let us see how we can do this.

Make sure you you have pip and pandas installed before trying this tutorial.

Install pandashells using pip as shown below.

pip install pandashells

To check if pandashells is installed correctly, Try following command in your bash shell.

p.df -h

For this tutorial, I have downloaded following covid-19 data from the github.

wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv

ls -lrt us-states.csv
-rw-r--r-- 1 root root 269230 Jul 31 02:55 us-states.csv

Let us look at the first few lines of the csv file.

Read csv file in Bash shell

cat us-states.csv | p.df 'df.head(2)'
"date","state","fips","cases","deaths"
"2020-01-21","Washington",53,1,0
"2020-01-22","Washington",53,1,0

OK, let me explain what happened in the above command. We piped the output of "cat" command to p.df tool. p.df prepares the dataframe. If you don't know about dataframes, I would suggest, do Google and read little about Pandas dataframe. 'df.head(2)' is printing the first two rows of our data.

We can print the above output in prettier tabular form like this...

Print csv file in tabular form in Bash shell

cat us-states.csv | p.df 'df.head(2)' -o table
       date       state  fips  cases  deaths
 2020-01-21  Washington    53      1       0
 2020-01-22  Washington    53      1       0

Let us try summarizing the data using Pandas df.describe() data on bash command prompt.

cat us-states.csv | p.df 'df.describe().T' -o table index
         count          mean           std  min    25%     50%      75%       max
fips    8264.0     31.768030     18.563556  1.0   17.0    31.0     46.0      78.0
cases   8264.0  28859.714303  61065.037822  1.0  736.5  6323.5  29468.5  494175.0
deaths  8264.0   1374.518030   3754.525731  0.0   12.0   180.0   1006.0   32362.0

Group csv data by columns in Bash Shell

We can also group the data by different columns and print on bash command line. Let us try to find out the total covid cases by state.

cat us-states.csv | p.df 'df.groupby("state")["cases"].apply(sum)' 'df.reset_index()' -o table|head -4
                    state     cases
                  Alabama   3134228
                   Alaska    108299
                  Arizona   5780908

See how we are mixing and matching the Pandas and Linux commands. Check out the head -4 at the end, which is a Linux command, reading the output from the Pandashells command.

Sort csv data in Bash Shell

We can also sort the data of CSV file like this.

cat us-states.csv | p.df 'df.sort_values(by="cases",ascending=False)' -o table | head -4
       date                     state  fips   cases  deaths
 2020-07-30                California     6  494175    9008
 2020-07-29                California     6  487855    8908
 2020-07-28                California     6  474951    8716

We can also save the output by piping to file as we usually do in Linux.

cat us-states.csv | p.df -o table | head -4 > test.csv

cat test.csv
       date                     state  fips   cases  deaths
 2020-01-21                Washington    53       1       0
 2020-01-22                Washington    53       1       0
 2020-01-23                Washington    53       1       0

Wrap Up!

In this short tutorial, I have just scratched the surface of what we can do using Pandas commands in Bash shell. There is a lot more than this that one can do to analyze the CSV data using Pandashells.