Data Science and Data Proicessing

The Definitive Data Scientist Environment Setup

Two Data Scientists at work
Two Data Scientists at work

Intro and motivation

In this post I would like to describe in detail our setup and development environment (hardware & software) and how to get it, step by step.

  • Most libraries just work out of the box with little extra configuration.
  • Allows to cover the full spectrum of Data related tasks, from Small to Big Data, and from standard Machine Learning models to Deep Learning prototyping.
  • Do not need to break your bank account to buy expensive hardware and software.

Hardware

Your laptop should have:

  • A powerful processor. At least an Intel i5 or i7 with 4 cores. It will save you a lot of time while processing data for obvious reasons.
  • A NVIDIA GPU of at least 4GB of RAM. Only if you need to prototype or fine-tune simple Deep Learning models. It will be orders of magnitude faster than almost any CPU for that task. Remember that you can’t train serious Deep Learning models from scratch in a laptop.
  • A good cooling system. You are going to run workloads for at least hours. Make sure your laptop can handle it without melting.
  • A SSD of at least 256GB should be enough.
  • Possibility to upgrade its capabilities, like adding a bigger SSD, more RAM, or easily replace battery.
david-laptop specs
david-laptop specs
Lenovo Thinkpad P52
Lenovo Thinkpad P52
  • You will suffer a terrible vendor lock-in, which means a huge cost to change to other alternative.
  • You can’t have a NVIDIA GPU, so forget about Deep Learning prototyping in your laptop.
  • Can not upgrade its hardware as it is soldered to the motherboard. In case you need more RAM, you have to buy a new laptop.

Operating System

Our go-to operating system for Data Science is the latest LTS (Long Term Support) of Ubuntu. At the time of writing this post, Ubuntu 20.04 LTS is the latest.

Ubuntu 20.04 LTS
Ubuntu 20.04 LTS
  • As you are going to be working with Data, security must be at the core of your setup. Linux is by default, more secure than Windows or OS X, and as it is used by a minority of people, most malicious software is not designed to run on Linux.
  • It is the most used Linux distro, both for desktops and servers, with a great and supportive community. When you find something is not working well, it will be easier to get help or find info about how to fix it.
  • Most servers are Linux based, and you probably want to deploy your code in those servers. The closer you are to the environment you are going to deploy to production, the better. This is one of the main reasons to use Linux as your platform.
  • It has a great package manager you can use to install almost everything.
Uncheck third-party software while installation
Uncheck third-party software while installation

NVIDIA Drivers

NVIDIA Linux support has been one of the complaints of the community for years. Remember that famous:

Linus Torvalds talking about NVIDIA
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-440
NVIDIA X Server Settings
NVIDIA X Server Settings
nvidia-smi
nvidia-smi

Terminal

While default GNOME Terminal is OK, I prefer using Terminator, a powerful terminal emulator which allows you to split the terminal window vertically ctr + shift + e and horizontally ctr + shift + o, as well as broadcasting commands to several terminals at the same time. This is useful to setup various servers or a cluster.

Terminator
Terminator
sudo apt install terminator

VirtualBox

VirtualBox is a software that allows you to run virtually other machines inside your current operating system session. You can also run different operating systems (Windows inside Linux, or the other way around).

sudo apt install virtualbox

Python, R and more (with Miniconda)

Python is already included with Ubuntu. But you should never use system Python or install your analytics libraries system-wide. You can break system Python doing that, and fixing it is hard.

  • Miniconda: includes just conda package manager. You still have access to all existing libraries through conda or pip, but those libraries will be downloaded and installed when they are needed. Go with this option, as it will save you time and memory.
bash Miniconda3-latest-Linux-x86_64.sh
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/david/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/david/miniconda3/etc/profile.d/conda.sh" ]; then
. "/home/david/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/home/david/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda help
conda install -c conda-forge conda-pack

Jupyter

Jupyter is a must for a Data Scientist, for developments where you need an interactive programming environment.

conda create -n jupyter_env
conda activate jupyter_env
conda install python=3.7
conda install -c conda-forge jupyterhub jupyterlab nodejs nb_conda_kernels
[Unit]
Description=JupyterHub
After=network.target
[Service]
User=root
Environment="PATH=/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/home/<your_user>/miniconda3/envs/jupyter_env/bin:/home/<your_user>/miniconda3/bin"
ExecStart=/home/<your_user>/miniconda3/envs/jupyter_env/bin/jupyterhub
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl start jupyterhub
sudo systemctl enable jupyterhub
JupyterHub login page
JupyterHub login page
Jupyter (Classic)
Jupyter (Classic)
Image for post
Jupyter (Lab)
JupyterHub control panel
JupyterHub control panel

IDEs

Python

Python is our primary language at WhiteBox.

  • Python environments, including native conda support.
  • Debugger.
  • Docker integration.
  • Git integration.
  • Scientific mode (pandas DataFrame and NumPy arrays inspection).
PyCharm Professional
PyCharm Professional
  • Jupyter: if you have doubts about when you should be using Jupyter or PyCharm and call yourself a Data <whatever>, please attend one of the bootcamps we teach asap.
sudo snap install pycharm-community --classic

Scala

Scala is a language we use for Big Data projects with native Spark, although we are shifting to PySpark.

sudo snap install intellij-idea-community --classic

Big Data

Okay, you are not going to really do Big Data in your laptop. In case you are in a Big Data project, your company or client is going to provide you with a proper Hadoop cluster.

conda install pyspark openjdk
from pyspark.sql import SparkSession

spark = SparkSession.builder. \
appName('your_app_name'). \
config('spark.sql.session.timeZone', 'UTC'). \
config('spark.driver.memory', '16G'). \
config('spark.driver.maxResultSize', '2G'). \
getOrCreate()
  • Modin: replicates pandas API with support for multi-core and out-of-core computations. It is specially useful when you are working in a powerful analytics server with lots of cores (32, 64) and want to use pandas. As per-core performance is usually poor, Modin will allow you to speed your computation.

Database Tools

Sometimes you need a tool able to connect with a variety of different DB technologies, make queries and explore data. Our choice is DBeaver:

DBeaver Community
DBeaver Community
  • Advanced networking requirements to connect, like SSH tunnels and more.
sudo snap install dbeaver-ce

Others

Other specific tools and apps that are important for us are:

  • Slack: sudo snap install slack --classic.
  • Telegram Desktop: sudo snap install telegram-desktop.
  • Nextcloud: sudo apt install nextcloud-desktop nautilus-nextcloud.
  • Thunderbird Mail: installed by default.
  • Zoom: download manually from here.
  • Google Chrome: download manually from here.

+ Recent posts