Data Science and Data Proicessing

Pandas DataFrame (Python): 10 useful tricks

10 basic tricks to make your pandas life a bit easier

Image for post

Pandas is a powerful open source data analysis and manipulation tool, built on top of the Python programming language. In this article, I will show 10 tricks regarding the pandas DataFrame to make certain programming practices a bit easier.

Of course, before we can use pandas, we have to import it by using the following command:

import pandas as pd

1. Select multiple rows and columns using .loc

countries = pd.DataFrame({
'country': ['United States', 'The Netherlands', 'Spain', 'Mexico', 'Australia'],
'capital': ['Washington D.C.', 'Amsterdam', 'Madrid', 'Mexico City', 'Canberra'],
'continent': ['North America', 'Europe', 'Europe', 'North America', 'Australia'],
'language': ['English', 'Dutch', 'Spanish', 'Spanish', 'English']})
Image for post

By using the operator, we are able to select subsets of rows and columns on the basis of their index label and column name. Below are some examples on how to use the loc operator on the ‘countries’ DataFrame:

countries.loc[:, 'country':'continent']
Image for post
countries.loc[0:2, 'country':'continent']
Image for post
countries.loc[[0, 4], ['country', 'language']]
Image for post

2. Filter DataFrames by category

In many cases, we may want to consider only the data points that are included in one particular category, or sometimes in a selection of categories. For a single category, we are able to do this by using the operator. However, for multiple categories, we have to make use of the function:

countries[countries.continent == 'Europe']
Image for post
countries[countries.language.isin(['Dutch', 'English'])]
Image for post

3. Filter DataFrames by excluding categories

As opposed to filtering by category, we may want to filter our DataFrame by excluding certain categories. We do this by making use of the (tilde) sign, which is the complement operator. Example usage:

countries[~countries.continent.isin(['Europe'])]
Image for post
countries[~countries.language.isin(['Dutch', 'English'])]
Image for post

4. Rename columns

You might want to change the name of certain columns because e.g. the name is incorrect or incomplete. For example, we might want to change the ‘capital’ column name to ‘capital_city’ and ‘language’ to ‘most_spoken_language’. We can do this in the following way:

countries.rename({'capital': 'capital_city', 'language': 'most_spoken_language'}, axis='columns')
Image for post

Alternatively, we can use:

countries.columns = ['country', 'capital_city', 'continent', 'most_spoken_language']

5. Reverse row order

To reverse the row order, we make use of the operator. This works in the following way:

countries.loc[::-1]
Image for post

However, note that now the indexes still are following the previous ordering. We have to make use of the function to reset the indexes:

countries.loc[::-1].reset_index(drop=True)
Image for post

6. Reverse column order

Reversing the column order goes in a similar way as for the rows:

countries.loc[:, ::-1]
Image for post

7. Split a DataFrame into two random subsets

In some cases, we want to split a DataFrame into two random subsets. For this, we make use of the function. For example, when creating a training and a test set out of the whole data set, we have to create two random subsets. Below, we show how to use the function:

countries_1 = countries.sample(frac=0.6, random_state=999)
countries_2 = countries.drop(countries_1.index)
Image for post
countries_1
Image for post
countries_2

8. Create dummy variables

students = pd.DataFrame({
'name': ['Ben', 'Tina', 'John', 'Eric'],
'gender': ['male', 'female', 'male', 'male']})

We might want to convert categorical variables into dummy/indicator variables. We can do so by making use of the function:

pd.get_dummies(students)
Image for post

To get rid of the redundant columns, we have to add :

pd.get_dummies(students, drop_first=True)
Image for post

9. Check equality of columns

When the goal is to check equality of two different columns, one might at first think of the operator, since this is mostly used when we are concerned with checking equality conditions. However, this operator does not handle NaN values properly, so we make use of the functions here. This goes as follows:

df = pd.DataFrame({'col_1': [1, 0], 'col_2': [0, 1], 'col_3': [1, 0]})
Image for post
df['col_1'].equals(df['col_2'])

>>> False

df['col_1'].equals(df['col_3'])

>>> True

10. Concatenate DataFrames

We might want to combine two DataFrames into one DataFrame that contains all data points. This can be achieved by using the function:

df_1 = pd.DataFrame({'col_1': [6, 7, 8], 'col_2': [1, 2, 3], 'col_3': [5, 6, 7]})
Image for post
pd.concat([df, df_1]).reset_index(drop=True)

Thanks for reading!

I hope this article helped you in some way, and I wish you good luck on your next project when making use of Pandas :).

+ Recent posts