Please Stop Doing These 5 Things in Pandas
These mistakes are super common and super easy to fix.
As someone who did over a decade of development before moving into Data Science, there’s a lot of mistakes I see data scientists make while using Pandas. The good news is these are really easy to avoid, and fixing them can also make your code more readable.
Mistake 1: Getting or Setting Values Slowly
It’s nobody’s fault that there are way too many ways to get and set values in Pandas. In some situations, you have to find a value using only an index or find the index using only the value. However, in many cases, you’ll have many different ways of selecting data at your disposal: index, value, label, etc.
In those situations, I prefer to use whatever is fastest. Here are some common choices from slowest to fastest, which shows you could be missing out on a 195% gain!
Tests were run using a DataFrame of 20,000 rows. Here’s the notebook if you want to run it yourself.
# .at - 22.3 seconds
for i in range(df_size):
df.at[i] = profile
Wall time: 22.3 s# .iloc - 15% faster than .at
for i in range(df_size):
df.iloc[i] = profile
Wall time: 19.1 s# .loc - 30% faster than .at
for i in range(df_size):
df.loc[i] = profile
Wall time: 16.5 s# .iat, doesn't work for replacing multiple columns of data.
# Fast but isn't comparable since I'm only replacing one column.
for i in range(df_size):
df.iloc[i].iat[0] = profile['address']
Wall time: 3.46 s# .values / .to_numpy() - 195% faster than .at
for i in range(df_size):
df.values[i] = profile
# Recommend using to_numpy() instead if you have Pandas 1.0+
# df.to_numpy()[i] = profile
Wall time: 254 ms
(As Alex Bruening and miraculixx noted in the comments, for loops are not the ideal way to perform actions like this, look at .apply(). I’m using them here purely to prove the speed difference of the line inside the loop.)
Mistake 2: Only Using 25% of Your CPU
Whether you’re on a server or just your laptop, the vast majority of people never use all the computing power they have. Most processors (CPUs) have 4 cores nowadays, and by default, Pandas will only ever use one.
Modin is a Python module built to enhance Pandas by making way better use of your hardware. Modin DataFrames don’t require any extra code and in most cases will speed up everything you do to DataFrames by 3x or more.
Modin acts as more of a plugin than a library since it uses Pandas as a fallback and cannot be used on its own.
The goal of Modin is to augment Pandas quietly and let you keep working without learning a new library. The only line of code most people will need is import modin.pandas as pd
replacing your normal import pandas as pd
, but if you want to learn more check out the documentation here.
In order to avoid recreating tests that have already been done, I’ve included this picture from the Modin documentation showing how much it can speed up the read_csv()
function on a standard laptop.
Please note that Modin is in development, and while I use it in production, you should expect some bugs. Check the Issues in GitHub and the Supported APIs for more information.
Mistake 3: Making Pandas Guess Data Types
When you import data into a DataFrame and don’t specifically tell Pandas the columns and datatypes, Pandas will read the entire dataset into memory just to figure out the data types.
For example, if you have a column full of text Pandas will read every value, see that they’re all strings, and set the data type to “string” for that column. Then it repeats this process for all your other columns.
You can use df.info()
to see how much memory a DataFrame uses, that’s roughly the same amount of memory Pandas will consume just to figure out the data types of each column.
Unless you’re tossing around tiny datasets or your columns are changing constantly, you should always specify the data types. In order to do this, just add the dtypes
parameter and a dictionary with your column names and their data types as strings. For example:
pd.read_csv(‘fake_profiles.csv’, dtype={
‘job’: ‘str’,
‘company’: ‘str’,
‘ssn’: ‘str’
})
Note: This also applies to DataFrames that don’t come from CSVs.
Mistake 4: Leftover DataFrames
One of the best qualities of DataFrames is how easy they are to create and change. The unfortunate side effect of this is most people end up with code like this:
# Change dataframe 1 and save it into a new dataframedf1 = pd.read_csv(‘file.csv’)df2 = df1.dropna()df3 = df2.groupby(‘thing’)
What happens is you leave df2
and df1
in Python memory, even though you’ve moved on to df3
. Don’t leave extra DataFrames sitting around in memory, if you’re using a laptop it’s hurting the performance of almost everything you do. If you’re on a server, it’s hurting the performance of everyone else on that server (or at some point, you’ll get an “out of memory” error).
Instead, here are some easy ways to keep your memory clean:
- Use
df.info()
to see how much memory a DataFrame is using - Install plugin support in Jupyter, then install the Variable Inspector plugin for Jupyter. If you’re used to having a variable inspector in R-Studio, you should know that R-Studio now supports Python!
- If you’re in a Jupyter session already, you can always erase variables without restarting by using
del df2
- Chain together multiple DataFrame modifications in one line (so long as it doesn’t make your code unreadable):
df = df.apply(thing1).dropna()
- As Roberto Bruno Martins pointed out, another way to ensure clean memory is to perform operations within functions. You can still unintentionally abuse memory this way, and explaining scope is outside the scope of this article, but if you aren’t familiar I’d encourage you to read this writeup.
Mistake 5: Manually Configuring Matplotlib
This might be the most common mistake, but it lands at #5 because it’s the least impactful. I see this mistake happen even in tutorials and blog posts from experienced professionals.
Matplotlib is automatically imported by Pandas, and it even sets some chart configuration up for you on every DataFrame.
There’s no need to import and configure it for every chart when it’s already baked into Pandas for you.
Here’s an example of doing it the wrong way, even though this is a basic chart it’s still a waste of code:
import matplotlib.pyplot as plt
ax.hist(x=df[‘x’])
ax.set_xlabel(‘label for column X’)
plt.show()
And here’s the right way:
df[‘x’].plot()
Easier, right? You can do anything on these DataFrame plot objects that you can do to any other Matplotlib plot object. For example:
df[‘x’].plot.hist(title=’Chart title’)
I’m sure I’m making other mistakes I don’t know about, but hopefully sharing these known ones with you will help put your hardware to better use, let you write less code, and get more done!
If you’re still looking for more optimizations, you’ll definitely want to read:
'Data Analytics(en)' 카테고리의 다른 글
Launch of the New Jupyter Book (0) | 2020.09.28 |
---|---|
Bringing the best out of Jupyter Notebooks for Data Science (0) | 2020.09.28 |
Interactive spreadsheets in Jupyter (0) | 2020.09.26 |
Pandas DataFrame (Python): 10 useful tricks (0) | 2020.09.25 |
Introducing Bamboolib — a GUI for Pandas (0) | 2020.09.25 |