Data Manipulation Using Python Pandas On Bash Command Line
CSV file has become most important format. Lot of the times, as a bash user we come across data in CSV format. Of course with sed,awk and other Linux utilities, we can parse any kind of data. But when it comes to CSV files, Python Pandas is a great library for data manipulation.
In this post, I will talk about a Python utility "Pandashells" to run Pandas data manipulation commands on Bash command line. Let us see how we can do this.
Make sure you you have pip and pandas installed before trying this tutorial.
Install pandashells using pip as shown below.
pip install pandashells
To check if pandashells is installed correctly, Try following command in your bash shell.
p.df -h
For this tutorial, I have downloaded following covid-19 data from the github.
wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv
ls -lrt us-states.csv -rw-r--r-- 1 root root 269230 Jul 31 02:55 us-states.csv
Let us look at the first few lines of the csv file.
Read csv file in Bash shell
cat us-states.csv | p.df 'df.head(2)'
"date","state","fips","cases","deaths"
"2020-01-21","Washington",53,1,0
"2020-01-22","Washington",53,1,0
OK, let me explain what happened in the above command. We piped the output of "cat" command to p.df tool. p.df prepares the dataframe. If you don't know about dataframes, I would suggest, do Google and read little about Pandas dataframe. 'df.head(2)' is printing the first two rows of our data.
We can print the above output in prettier tabular form like this...
Print csv file in tabular form in Bash shell
cat us-states.csv | p.df 'df.head(2)' -o table
date state fips cases deaths
2020-01-21 Washington 53 1 0
2020-01-22 Washington 53 1 0
Let us try summarizing the data using Pandas df.describe() data on bash command prompt.
cat us-states.csv | p.df 'df.describe().T' -o table index
count mean std min 25% 50% 75% max
fips 8264.0 31.768030 18.563556 1.0 17.0 31.0 46.0 78.0
cases 8264.0 28859.714303 61065.037822 1.0 736.5 6323.5 29468.5 494175.0
deaths 8264.0 1374.518030 3754.525731 0.0 12.0 180.0 1006.0 32362.0
Group csv data by columns in Bash Shell
We can also group the data by different columns and print on bash command line. Let us try to find out the total covid cases by state.
cat us-states.csv | p.df 'df.groupby("state")["cases"].apply(sum)' 'df.reset_index()' -o table|head -4
state cases
Alabama 3134228
Alaska 108299
Arizona 5780908
See how we are mixing and matching the Pandas and Linux commands. Check out the head -4 at the end, which is a Linux command, reading the output from the Pandashells command.
Sort csv data in Bash Shell
We can also sort the data of CSV file like this.
cat us-states.csv | p.df 'df.sort_values(by="cases",ascending=False)' -o table | head -4
date state fips cases deaths
2020-07-30 California 6 494175 9008
2020-07-29 California 6 487855 8908
2020-07-28 California 6 474951 8716
We can also save the output by piping to file as we usually do in Linux.
cat us-states.csv | p.df -o table | head -4 > test.csv
cat test.csv
date state fips cases deaths
2020-01-21 Washington 53 1 0
2020-01-22 Washington 53 1 0
2020-01-23 Washington 53 1 0
Wrap Up!
In this short tutorial, I have just scratched the surface of what we can do using Pandas commands in Bash shell. There is a lot more than this that one can do to analyze the CSV data using Pandashells.