How to remove Nan values from data in Pandas
Lets just import the library first
import pandas as pd
I have the movies database which I have downloaded from Kaggle for this exercise. Lets read the data and look at first few rows by using head which will first 10 rows...
df = pd.read_csv("movies_metadata.csv")
df.head()
Lets find out the name of columns we have in the data by using df.columns
df.columns
Out[12]:
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
'imdb_id', 'original_language', 'original_title', 'overview',
'popularity', 'poster_path', 'production_companies',
'production_countries', 'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline', 'title', 'video',
'vote_average', 'vote_count'],
dtype='object')
Lets see how many rows we have in the data...
df.size Out[47]: 1091184
So there are 1091184 rows in the data.
Lets do a simple query on the data. Lets find out all the rows which contains title "Toy Story". Here is the query to do that...
df[df.title.str.contains('Toy Story', case=False)]
But I got following error...
ValueError: cannot index with vector containing NA / NaN values
How To Fix The Error "cannot index with vector containing NA"
To fix the above error, we can either ignore the Na/Nan values and then run above command or remove the Na/Nan values altogether. Lets try the first idea that is ignore the Nan values. The command to do that is following...
df[df.title.str.contains('Toy Story', case=False) & (df.title.isna()==False)]
To find out how many records we get , we can use len() python method on the df since it is a list.
len(df[df.title.str.contains('Toy Story', case=False) & (df.title.isna()==False)])
Out[52]:
5
We got 5 rows.
The above method will ignore the NaN values from title column. We can also remove all the rows which have NaN values...
How To Drop NA Values Using Pandas DropNa
df1 = df.dropna() In [46]: df1.size Out[46]: 16632
As we can see above dropna() will remove all the rows where at least one value has Na/NaN value. Number of rows have reduced to 16632.