Scikit-Learn (Python): 6 Useful Tricks for Data Scientists

Tricks to improve your machine learning models in Python with scikit-learn (sklearn)

Jul 16 · 7 min read

Scikit-learn (sklearn) is a powerful open source machine learning library built on top of the Python programming language. This library contains a lot of efficient tools for machine learning and statistical modeling, including various classification, regression, and clustering algorithms.

In this article, I will show 6 tricks regarding the scikit-learn library to make certain programming practices a bit easier.

1. Generate random dummy data

To generate random ‘dummy’ data, we can make use of the make_classification() function in case of classification data, and make_regression() function in case of regression data. This is very useful in some cases when debugging or when you want to try out certain things on a (small) random data set.

Below, we generate 10 classification data points consisting of 4 features (found in X) and a class label (found in y), where the data points belong to either the negative class (0) or the positive class (1):

from sklearn.datasets import make_classification
import pandas as pdX, y = make_classification(n_samples=10, n_features=4, n_classes=2, random_state=123)

Here, X consists of the 4 feature columns for the generated data points:

pd.DataFrame(X, columns=['Feature_1', 'Feature_2', 'Feature_3', 'Feature_4'])

And y contains the corresponding label of each data point:

pd.DataFrame(y, columns=['Label'])

2. Impute missing values

Scikit-learn offers multiple ways to impute missing values. Here, we consider two approaches. The SimpleImputer class provides basic strategies for imputing missing values (through the mean or median for example). A more sophisticated approach the KNNImputer class, which provides imputation for filling in missing values using the K-Nearest Neighbors approach. Each missing value is imputed using values from the n_neighbors nearest neighbors that have a value for the particular feature. The values of the neighbors are averaged uniformly or weighted by distance to each neighbor.

Below, we show an example application using both imputation methods:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.datasets import make_classification
import pandas as pdX, y = make_classification(n_samples=5, n_features=4, n_classes=2, random_state=123)
X = pd.DataFrame(X, columns=['Feature_1', 'Feature_2', 'Feature_3', 'Feature_4'])print(X.iloc[1,2])

>>> 2.21298305

Transform X[1, 2] to a missing value:

X.iloc[1, 2] = float('NaN')X

First we make use of the simple imputer:

imputer_simple = SimpleImputer()

pd.DataFrame(imputer_simple.fit_transform(X))

Resulting in a value of -0.143476.

Next, we try the KNN imputer, where the 2 nearest neighbors are considered and the neighbors are weighted uniformly:

imputer_KNN = KNNImputer(n_neighbors=2, weights="uniform")pd.DataFrame(imputer_KNN.fit_transform(X))

Resulting in a value of 0.997105 (= 0.5*(1.904188+0.090022)).

3. Make use of Pipelines to chain multiple steps together

The Pipeline tool in scikit-learn is very helpful to simplify your machine learning models. Pipelines can be used to chain multiple steps into one, so that the data will go through a fixed sequence of steps. Thus, instead of calling every step separately, the pipeline concatenates all steps into one system. To create such a pipeline, we make use of the make_pipeline function.

Below, a simple example is shown, where the pipeline consists of an imputer, which imputes missing values (if there are any), and a logistic regression classifier.

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
import pandas as pdX, y = make_classification(n_samples=25, n_features=4, n_classes=2, random_state=123)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

imputer = SimpleImputer()
clf = LogisticRegression()

pipe = make_pipeline(imputer, clf)

Now, we can use the pipeline to fit our training data and to make predictions for the test data. First, the training data goes through to imputer, and then it starts training using the logistic regression classifier. Then, we are able to predict the classes for our test data:

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)pd.DataFrame({'Prediction': y_pred, 'True': y_test})

4. Save a Pipeline model using joblib

Pipeline models created through scikit-learn can easily be saved by making use of joblib. In case your model contains large arrays of data, each array is stored in a separate file. Once saved locally, one can easily load (or, restore) their model for use in new applications.

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
import joblibX, y = make_classification(n_samples=20, n_features=4, n_classes=2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

imputer = SimpleImputer()
clf = LogisticRegression()

pipe = make_pipeline(imputer, clf)

pipe.fit(X_train, y_train)joblib.dump(pipe, 'pipe.joblib')

Now, the fitted pipeline model is saved (dumped) on your computer through joblib.dump. This model is restored through joblib.load, and can be applied as usual afterwards:

new_pipe = joblib.load('.../pipe.joblib')new_pipe.predict(X_test)

5. Plot a confusion matrix

A confusion matrix is a table that is used to describe the performance of a classifier on a set of test data. Here, we focus on a binary classification problem, i.e., there are two possible classes that observations could belong to: “yes” (1) and “no” (0).

Let’s create an example binary classification problem, and display the corresponding confusion matrix, by making use of the plot_confusion_matrix function:

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classificationX, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

clf = LogisticRegression()

clf.fit(X_train, y_train)

confmat = plot_confusion_matrix(clf, X_test, y_test, cmap="Blues")

Here, we have visualized in a nice way through the confusion matrix that there are:

93 true positives (TP);
97 true negatives (TN);
3 false positives (FP);
7 false negatives (FN).

So, we have reached an accuracy score of (93+97)/200 = 95%.

6. Visualize decision trees

One of the most well known classification algorithms is the decision tree, characterized by its tree-like visualizations which are very intuitive. The idea of a decision tree is to split the data into smaller regions based on the descriptive features. Then, the most commonly occurring class amongst training observations in the region to which the test observation belongs is the prediction. To decide how the data is split into regions, one has to apply a splitting measure to determine the relevance and importance of each of the features. Some well known splitting measures are Information Gain, Gini index and Cross-entropy.

Below, we show an example on how to make use of the plot_tree function in scikit-learn:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=50, n_features=4, n_classes=2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

plot_tree(clf, filled=True)

In this example, we are fitting a decision tree on 40 training observations, that belong to either the negative class (0) or the positive class (1), so we are dealing with a binary classification problem. In the tree, we have two kinds of nodes, namely internal nodes (nodes where the predictor space is split further) or terminal nodes (end point). The segments of the trees that connect two nodes are called branches.

Let‘s have a closer look at the information provided for each node in the decision tree:

The splitting criterion used in the particular node is shown as e.g. ‘F2 <= -0.052’. This means that every data point that satisfies the condition that the value of the second feature is below -0.052 belongs to the newly formed region to the left, and the data points that do not satisfy the condition belong to the region to the right of the internal node.
The Gini index is used as splitting measure here. The Gini index (called a measure of impurity) measures the degree or probability of a particular element being wrongly classified when it is randomly chosen.
The ‘samples’ of the node indicates how many training observations are found in the particular node.
The ‘value’ of the node indicates the number of training observations found in the negative class (0) and the positive class (1) respectively. So, value=[19,21] means that 19 observations belong to the negative class and 21 observations belong to the positive class in that particular node.

Conclusion

This article covered 6 useful scikit-learn tricks to improve your machine learning models in sklearn. I hope these tricks have helped you in some way, and I wish you good luck on your next project when making use of the scikit-learn library!

Level Up Coding

Thanks for being a part of our community! Subscribe to our YouTube channel or join the Skilled.dev coding interview course.

Coding Interview Questions | Skilled.dev

The course to master the coding interview

skilled.dev

'Data Analytics(en)' 카테고리의 다른 글

🔝Top 29 Useful Python Snippets 🔝 That Save You Time (0)	2020.10.08
You are telling people that you are a Python beginner if you ask this question. (0)	2020.10.07
New Features in Python 3.9 (0)	2020.10.05
No More Basic Plots Please (0)	2020.10.05
The Definitive Data Scientist Environment Setup (0)	2020.10.03

Python 3.9의 새로운 기능

사전 연합

세련된 구문으로 내가 가장 좋아하는 새로운 기능 중 하나입니다. 두 개의 사전이있는 경우ㅏ과비병합해야하므로 이제조합원.

우리는병합운영자|:

a = {1: 'a', 2: 'b', 3: 'c'}
b = {4: 'd', 5: 'e'}c = a | b
print(c)

[밖]:{1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e'}

그리고최신 정보운영자| =, 원본 사전을 업데이트합니다.

a = {1: 'a', 2: 'b', 3: 'c'}
b = {4: 'd', 5: 'e'}a |= b
print(a)

[밖]:{1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e'}

우리 사전이공통 키, 두 번째 사전의 키-값 쌍이 사용됩니다.

a = {1: 'a', 2: 'b', 3: 'c', 6: 'in both'}
b = {4: 'd', 5: 'e', 6: 'but different'}print(a | b)

[밖]:{1 : 'a', 2 : 'b', 3 : 'c',6 : '하지만 다르다', 4 : 'd', 5 : 'e'}

Iterables로 사전 업데이트

의 또 다른 멋진 동작| =운영자는 능력입니다최신 정보새로운 키-값 쌍이있는 사전반복 가능객체 — 목록 또는 생성기와 같은 :

a = {'a': 'one', 'b': 'two'}
b = ((i, i**2) for i in range(3))a |= b
print(a)

[밖]:{ 'a': 'one', 'b': 'two',0 : 0, 1 : 1, 2 : 4}

표준 유니온 연산자로 동일하게 시도하면|우리는 얻을 것이다TypeError그것은 사이의 결합 만 허용하기 때문에dict유형.

유형 힌트

Python은 동적으로 입력되므로 코드에서 데이터 유형을 지정할 필요가 없습니다.

이것은 괜찮지 만 때로는 혼란 스러울 수 있으며 갑자기 Python의 유연성이 다른 무엇보다 성가신 문제가됩니다.

3.5부터 유형을 지정할 수 있었지만 꽤 번거 롭습니다. 이 업데이트는 진정으로 변경되었습니다. 예를 들어 보겠습니다.

우리의add_int함수, 우리는 분명히 같은 숫자를 자체에 추가하고 싶습니다 (정의되지 않은 신비한 이유 때문에). 하지만 저희 에디터는 그것을 모르고 있습니다.+— 따라서 경고가 제공되지 않습니다.

이제 우리가 할 수있는 것은 예상되는 입력 유형을 다음과 같이 지정하는 것입니다.int. 이를 사용하여 편집자는 즉시 문제를 파악합니다.

포함 된 유형에 대해서도 매우 구체적으로 알 수 있습니다. 예를 들면 다음과 같습니다.

유형 힌트는 모든 곳에서 사용할 수 있으며 새로운 구문 덕분에 이제 훨씬 깔끔해 보입니다.

문자열 방법

다른 새로운 기능만큼 화려하지는 않지만 특히 유용하므로 언급 할 가치가 있습니다. 접두사 및 접미사를 제거하기위한 두 가지 새로운 문자열 메서드가 추가되었습니다.

"Hello world".removeprefix("He")

[밖]:"llo 세계"

Hello world".removesuffix("ld")

[밖]:"안녕하세요 세상"

새 파서

이것은 보이지 않는 변화에 가깝지만 Python의 미래 진화에 가장 중요한 변화 중 하나가 될 가능성이 있습니다.

파이썬은 현재 주로 LL (1) 기반 문법을 사용하는데, 이는 LL (1) 파서에 의해 파싱 될 수 있습니다.이 문법은 코드를 위에서 아래로, 왼쪽에서 오른쪽으로, 단 하나의 토큰으로 미리보기로 파싱합니다.

자, 저는 이것이 어떻게 작동하는지 거의 알지 못합니다.하지만이 방법을 사용하기 때문에 파이썬에서 현재 몇 가지 문제를 알려 드릴 수 있습니다.

파이썬은 비 LL (1) 문법을 포함합니다; 이로 인해 현재 문법의 일부는 해결 방법을 사용하여 불필요한 복잡성을 만듭니다.
LL (1)은 Python 구문에 제한을 만듭니다 (가능한 해결 방법 없음).이번 호다음 코드는 현재 파서를 사용하여 구현할 수 없음을 강조합니다.SyntaxError) :

with (open("a_really_long_foo") as foo,
      open("a_really_long_bar") as bar):
    pass

LL (1)은 파서에서 왼쪽 재귀로 중단됩니다. 특정 재귀 구문이 구문 분석 트리에서 무한 루프를 일으킬 수 있음을 의미합니다.귀도 반 로섬, Python의 창시자,여기에 설명.

이러한 모든 요소 (그리고 단순히 이해할 수없는 더 많은 요소)는 Python에 큰 영향을줍니다. 그들은 언어의 진화를 제한합니다.

다음을 기반으로하는 새로운 파서못,Python 개발자에게 훨씬 더 많은 유연성을 제공 할 것입니다.Python 3.10 이상.

이것이 곧 출시 될 Python 3.9에서 기대할 수있는 모든 것입니다. 정말 기다릴 수 없다면 최신 베타 릴리스 인 3.9.0b3여기에서 사용할 수 있습니다.

질문이나 제안이 있으시면 언제든지트위터또는 아래 의견에.

읽어 주셔서 감사합니다!

이 기사를 즐겼고 Python의 잘 알려지지 않은 기능에 대해 더 알고 싶다면 이전 기사에 관심이있을 것입니다.

덜 알려진 Python 기능

덜 알려지고 과소 평가 된 Python 기능 샘플

intodatascience.com

'Data Analytics(ko)' 카테고리의 다른 글

You are telling people that you are a Python beginner if you ask this question. -번역 (0)	2020.10.07
Scikit-Learn (Python): 6 Useful Tricks for Data Scientists -번역 (0)	2020.10.06
No More Basic Plots Please -번역 (0)	2020.10.05
The Definitive Data Scientist Environment Setup -번역 (0)	2020.10.03
Extracting Data from PDF File Using Python and R -번역 (0)	2020.10.02

New Features in Python 3.9

A look at the best features included in the latest iteration of Python

James Briggs

Follow

Jun 13 · 5 min read

It’s that time again, a new version of Python is imminent. Now in beta (3.9.0b3), we will soon be seeing the full release of Python 3.9.

Some of the newest features are incredibly exciting, and it will be amazing to see them used after release. We’ll cover the following:

Dictionary Union Operators
Type Hinting
Two New String Methods
New Python Parser — this is very cool

Let’s take a first look at these new features and how we use them.

(Versione in Italiano)

Dictionary Unions

One of my favorite new features with a sleek syntax. If we have two dictionaries a and b that we need to merge, we now use the union operators.

We have the merge operator |:

a = {1: 'a', 2: 'b', 3: 'c'}
b = {4: 'd', 5: 'e'}c = a | b
print(c)

[Out]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

And the update operator |=, which updates the original dictionary:

a = {1: 'a', 2: 'b', 3: 'c'}
b = {4: 'd', 5: 'e'}a |= b
print(a)

[Out]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

If our dictionaries share a common key, the key-value pair in the second dictionary will be used:

a = {1: 'a', 2: 'b', 3: 'c', 6: 'in both'}
b = {4: 'd', 5: 'e', 6: 'but different'}print(a | b)

[Out]: {1: 'a', 2: 'b', 3: 'c', 6: 'but different', 4: 'd', 5: 'e'}

Dictionary Update with Iterables

Another cool behavior of the |= operator is the ability to update the dictionary with new key-value pairs using an iterable object — like a list or generator:

a = {'a': 'one', 'b': 'two'}
b = ((i, i**2) for i in range(3))a |= b
print(a)

[Out]: {'a': 'one', 'b': 'two', 0: 0, 1: 1, 2: 4}

If we attempt the same with the standard union operator | we will get a TypeError as it will only allow unions between dict types.

Type Hinting

Python is dynamically typed, meaning we don’t need to specify datatypes in our code.

This is okay, but sometimes it can be confusing, and suddenly Python’s flexibility becomes more of a nuisance than anything else.

Since 3.5, we could specify types, but it was pretty cumbersome. This update has truly changed that, let’s use an example:

In our add_int function, we clearly want to add the same number to itself (for some mysterious undefined reason). But our editor doesn’t know that, and it is perfectly okay to add two strings together using + — so no warning is given.

What we can now do is specify the expected input type as int. Using this, our editor picks up on the problem immediately.

We can get pretty specific about the types included too, for example:

Type hinting can be used everywhere — and thanks to the new syntax, it now looks much cleaner:

String Methods

Not as glamourous as the other new features, but still worth a mention as it is particularly useful. Two new string methods for removing prefixes and suffixes have been added:

"Hello world".removeprefix("He")

[Out]: "llo world"

Hello world".removesuffix("ld")

[Out]: "Hello wor"

New Parser

This one is more of an out-of-sight change but has the potential of being one of the most significant changes for the future evolution of Python.

Python currently uses a predominantly LL(1)-based grammar, which in turn can be parsed by a LL(1) parser — which parses code top-down, left-to-right, with a lookahead of just one token.

Now, I have almost no idea of how this works — but I can give you a few of the current issues in Python due to the use of this method:

Python contains non-LL(1) grammar; because of this, some parts of the current grammar use workarounds, creating unnecessary complexity.
LL(1) creates limitations in the Python syntax (without possible workarounds). This issue highlights that the following code simply cannot be implemented using the current parser (raising a SyntaxError):

with (open("a_really_long_foo") as foo,
      open("a_really_long_bar") as bar):
    pass

LL(1) breaks with left-recursion in the parser. Meaning particular recursive syntax can cause an infinite loop in the parse tree. Guido van Rossum, the creator of Python, explains this here.

All of these factors (and many more that I simply cannot comprehend) have one major impact on Python; they limit the evolution of the language.

The new parser, based on PEG, will allow the Python developers significantly more flexibility — something we will begin to notice from Python 3.10 onwards.

That is everything we can look forward to with the upcoming Python 3.9. If you really can’t wait, the most recent beta release — 3.9.0b3 — is available here.

If you have any questions or suggestions, feel free to reach out via Twitter or in the comments below.

Thanks for reading!

If you enjoyed this article and want to learn more about some of the lesser-known features in Python, you may be interested in my previous article:

Lesser known Python Features

A sample of some lesser known and underrated Python features

towardsdatascience.com

'Data Analytics(en)' 카테고리의 다른 글

You are telling people that you are a Python beginner if you ask this question. (0)	2020.10.07
Scikit-Learn (Python): 6 Useful Tricks for Data Scientists (0)	2020.10.06
No More Basic Plots Please (0)	2020.10.05
The Definitive Data Scientist Environment Setup (0)	2020.10.03
Extracting Data from PDF File Using Python and R (0)	2020.10.02

더 이상 기본 플롯은 없습니다

Seaborn 및 Matplotlib를 사용하여 데이터 시각화를 업그레이드하기위한 빠른 가이드

돌 시키

9 월 2 일 · 4최소 읽기

"기본 파란색 막대 그림이 하나 더 표시되면…"

Flatiron School NYC에서 제 연구의 첫 번째 모듈을 마친 후 Seaborn과 Matplotlib를 사용하여 플롯 사용자 정의 및 디자인을 시작했습니다. 수업 중 낙서와 마찬가지로, 저는 jupyter 노트북에 다른 스타일의 플롯을 코딩하기 시작했습니다.

이 기사를 읽고 나면 모든 노트북에 대해 최소한 하나의 빠른 스타일의 플롯 코드를 염두에 두어야합니다.

더 이상 기본값, 상점 브랜드, 기본 플롯,부디!

아무것도 할 수 없다면 Seaborn을 사용하십시오.

괜찮아 보이는 플롯을 만드는 데 5 초가 주어지지 않으면 세상이 붕괴 될 것입니다. Seaborn을 사용하십시오!

Matplotlib를 사용하여 빌드 된 Seaborn은 즉각적인 디자인 업그레이드가 될 수 있습니다. x 및 y 값의 레이블과 기본이 아닌 기본 색 구성표를 자동으로 할당합니다. (— IMO : 좋은, 명확하고, 잘 포맷 된 열 레이블링과 데이터 정리를 통해 보상합니다.) Matplotlib는이를 자동으로 수행하지 않지만 플롯하려는 항목에 따라 항상 x와 y를 정의하도록 요청하지 않습니다.

다음은 Seaborn을 사용하는 것과 사용자 정의가없는 Matplotlib를 사용하는 동일한 플롯입니다.

위에서부터 스타일링

시각화하는 데이터에 따라 스타일과 배경을 변경하면 해석 가능성과 가독성이 높아질 수 있습니다. 코드 맨 위에 스타일을 구현하여이 스타일을 계속 사용할 수 있습니다.

스타일 라인에 대한 전체 문서 페이지가 있습니다.Matplotlib.

스타일링은 가져온 라이브러리 후에 간단한 코드 줄로 스타일을 설정하는 것처럼 간단 할 수 있습니다. GGPlot은 배경을 회색으로 변경하고 특정 글꼴을 사용합니다. 시도해 볼 수있는 더 많은 스타일이 있습니다.

XKCD; 건방진 약간의 여분

장난. 전문가가 아닙니다. 하지만 너무 재미 있습니다.

이 XKCD 스타일을 사용하는 경우 plt.rcdefaults ()…를 실행하여 기본값을 재설정 할 때까지 계속됩니다.

예쁜 색상 OMG!

매력적인 플롯을 만드십시오. 색 이론이 여기에서 작용합니다. Seaborn에는 Matplot lib에서도 사용할 수있는 다양한 팔레트가 있으며 직접 만들 수도 있습니다.

단색 : 하나와 완료

위는 선, 산점도 등을 변경하기 위해 호출 할 수있는 단일 색상 이름 목록입니다.

게으른? Seaborn의 기본 테마

기본값의 6 가지 변형이 있습니다.
깊은,음소거,파스텔,선명한,어두운, 및색맹
x, y 및 데이터를 전달한 후 색상을 인수로 사용
색상 =‘색맹’

어렵지 않고 스마트하게 작업 : Pre-Fab 팔레트

색상 팔레트()seaborn 팔레트 또는 matplotlib 컬러 맵을 허용합니다.

개인적으로 좋아하는 것은 'magma'와 'viridis'입니다.

제어 괴물? 사용자 지정 팔레트 / 16 진수 코드 사용

pretty_colors = [“# FF4653”,“# EE7879”,“# DDEDF4”,“# 2A3166”]
온라인에서 찾을 수있는 16 진수 코드를 전달합니다.
종류를 만들고 세부 사항을 추가하고 더 많은 사용자 정의 팔레트를 위해 매개 변수를 가지고 놀아보십시오.

색상 팔레트를 선택하는 다양한 방법.

모든 것에는 레이블이 있어야합니다

여기에서는 Matplotlib를 사용하고 있지만 명확하고 간결한 해석을 위해 각 줄, 제목, x 및 y 레이블 및 범례에 대해 단일 색상을 추가했습니다.

모든 변수에는 집이 있고 지금은 기쁨을 불러 일으 킵니다. — Marie Kondo가 어떻게 코딩할지 생각해보십시오.

간단하지만 명확합니다.

전반적으로 꽤 간단 하죠? 글쎄, 이제 당신은 그 추악한 기본 플롯에 대한 변명의 여지가 없습니다. 이 정보가 도움이 되었기를 바랍니다. 설명서에는 색상과 디자인에 대한 훨씬 더 많은 내용이 포함되어 있으므로 이러한 빠른 팁을 숙지했으면 아래 설명서를 참조하십시오!

즐겨? 우리 친구하자. 나를 따라와GitHub,인스 타 그램, 및매질

Corey Schaffer의 오늘의 데이터: 개발자 급여 데이터

선적 서류 비치:

Seaborn

Matplotlib

기타 리소스 :

Matplotlib의 XKCD

'Data Analytics(ko)' 카테고리의 다른 글

Scikit-Learn (Python): 6 Useful Tricks for Data Scientists -번역 (0)	2020.10.06
New Features in Python 3.9 -번역 (0)	2020.10.05
The Definitive Data Scientist Environment Setup -번역 (0)	2020.10.03
Extracting Data from PDF File Using Python and R -번역 (0)	2020.10.02
Advanced Python: Itertools Library — The Gem Of Python Language -번역 (0)	2020.10.01

No More Basic Plots Please

A Quick Guide to Upgrade Your Data Visualizations using Seaborn and Matplotlib

dolcikey

Sep 2 · 4 min read

“If I see one more basic blue bar plot…”

After completing the first module in my studies at Flatiron School NYC, I started playing with plot customizations and design using Seaborn and Matplotlib. Much like doodling during class, I started coding other styled plots in our jupyter notebooks.

After reading this article, you’re expected to have at least one quick styled plot code in mind for every notebook.

No more default, store brand, basic plots, please!

If you can do nothing else, use Seaborn.

You have five seconds to make a decent looking plot or the world will implode; use Seaborn!

Seaborn, which is build using Matplotlib can be an instant design upgrade. It automatically assigns the labels from your x and y values and a default color scheme that’s less… basic. ( — IMO: it rewards good, clear, well formatted column labeling and through data cleaning) Matplotlib does not do this automatically, but also does not ask for x and y to be defined at all times depending on what you are looking to plot.

Here are the same plots, one using Seaborn and one Matplotlib with no customizations.

Style it from the top

Depending on the data you are visualizing, changing the style and backgrounds may increase interpretability and readability. You can carry this style throughout by implementing a style at the top of your code.

There is a whole documentation page on styline via Matplotlib.

Styling can be as simple as setting the style with a simple line of code after your imported libraries. GGPlot changed the background to grey, and has a specific font. There are many more styles you can tryout.

XKCD; a cheeky little extra

Fun. Not professional. But so fun.

Be aware that if you use this XKCD style it will continue until you reset the defaults by running plt.rcdefaults()…

PRETTY COLORS OMG!

Make your plots engaging. Color theory comes into play here. Seaborn has a mix of palettes which can also be used in Matplot lib, plus you can also make your own.

Single Colors: One and Done

Above is a list of single color names you can call to change lines, scatter plots, and more.

Lazy? Seaborn’s Default Themes

has six variations of default
deep, muted, pastel, bright, dark, and colorblind
use color as an argument after passing in x, y, and data
color = ‘colorblind’

Work Smarter Not Harder: Pre-Fab Palettes

color_palette() accepts any seaborn palette or matplotlib colormap

Personal favorites are ‘magma’ and ‘viridis’

Control Freak? Custom Palettes / Using Hex Codes

pretty_colors = [“#FF4653”, “#EE7879”,“#DDEDF4”, “#2A3166”]
pass in hex codes which can be found online
create a kind and add in specifics, play around with the parameters for more customized palettes

Many ways to choose a color palette.

Everything Should Have a Label

Here we are using Matplotlib, but we have added a single color for each line, a title, x and y labels, and a legend for clear concise interpretation.

Every variable has a home, and it sparks joy now, right? — Think how would Marie Kondo code.

Simple, but clear.

Overall, pretty simple right? Well, now you have no excuses for those ugly basic plots. I hope you found this helpful and mayb a little bit fun. There’s so much more on color and design in the documentation, so once you’ve mastered these quick tips, dive in on the documentation below!

Enjoy? Let’s be friends. Follow me on GitHub, Instagram, and Medium

Today’s Data by Corey Schaffer: Developer Salaries Data

Documentation:

Seaborn

Matplotlib

Other Resources:

XKCD in Matplotlib

'Data Analytics(en)' 카테고리의 다른 글

Scikit-Learn (Python): 6 Useful Tricks for Data Scientists (0)	2020.10.06
New Features in Python 3.9 (0)	2020.10.05
The Definitive Data Scientist Environment Setup (0)	2020.10.03
Extracting Data from PDF File Using Python and R (0)	2020.10.02
Advanced Python: Itertools Library — The Gem Of Python Language (0)	2020.10.01

Rank	Game	Publisher
1	리니지M	NCSOFT
2	리니지2M	NCSOFT
3	KartRider Rush+	NEXON Company
4	바람의나라: 연	NEXON Company
5	Genshin Impact	miHoYo Limited
6	V4	NEXON Company
7	기적의 검	4399 KOREA
8	R2M	Webzen Inc.
9	뮤 아크엔젤	Webzen Inc.
10	라이즈 오브 킹덤즈	LilithGames
11	블레이드&소울 레볼루션	Netmarble
12	A3: 스틸얼라이브	Netmarble
13	AFK 아레나	LilithGames
14	Pmang Poker : Casino Royal	NEOWIZ corp
15	가디언 테일즈	Kakao Games Corp.
16	그랑삼국	YOUZU(SINGAPORE)PTE.LTD.
17	일루전 커넥트	ChangYou
18	슬램덩크	DeNA HONG KONG LIMITED
19	라그나로크 오리진	GRAVITY Co., Ltd.
20	리니지2 레볼루션	Netmarble
21	FIFA ONLINE 4 M by EA SPORTS™	NEXON Company
22	컴투스프로야구2020	Com2uS
23	Summoners War	Com2uS
24	PUBG MOBILE	PUBG CORPORATION
25	스테리테일	4399 KOREA
26	Roblox	Roblox Corporation
27	한게임 포커	NHN BIGFOOT
28	랑그릿사	ZlongGames
29	FIFA Mobile	NEXON Company
30	검은사막 모바일	PEARL ABYSS
31	마구마구 2020	Netmarble
32	Brawl Stars	Supercell
33	메이플스토리M	NEXON Company
34	Epic Seven	Smilegate Megaport
35	Gardenscapes	Playrix
36	Cookie Run: OvenBreak - Endless Running Platformer	Devsisters Corporation
37	동방불패 모바일	Perfect World Korea
38	케페우스M	Ujoy Games
39	Age of Z Origins	Camel Games Limited
40	프린세스 커넥트! Re:Dive	Kakao Games Corp.
41	뮤오리진2	Webzen Inc.
42	Empires & Puzzles: Epic Match 3	Small Giant Games
43	Lord of Heroes	CloverGames
44	황제라 칭하라	Clicktouch Co., Ltd.
45	Rise of Empires: Ice and Fire	Long Tech Network Limited
46	리니지M(12)	NCSOFT
47	명일방주	Yostar Limited.
48	Random Dice: PvP Defense	111%
49	Clash of Clans	Supercell
50	Homescapes	Playrix

'Game Rank' 카테고리의 다른 글

2020년 10월 7일 플레이스토어 게임 매출 순위 (0)	2020.10.07
2020년 10월 6일 플레이스토어 게임 매출 순위 (0)	2020.10.06
2020년 10월 4일 플레이스토어 게임 매출 순위 (0)	2020.10.04
2020년 10월 3일 플레이스토어 게임 매출 순위 (0)	2020.10.03
2020년 10월 2일 플레이스토어 게임 매출 순위 (0)	2020.10.02

Rank	Game	Publisher
1	리니지M	NCSOFT
2	리니지2M	NCSOFT
3	바람의나라: 연	NEXON Company
4	KartRider Rush+	NEXON Company
5	R2M	Webzen Inc.
6	V4	NEXON Company
7	기적의 검	4399 KOREA
8	Genshin Impact	miHoYo Limited
9	뮤 아크엔젤	Webzen Inc.
10	블레이드&소울 레볼루션	Netmarble
11	가디언 테일즈	Kakao Games Corp.
12	라이즈 오브 킹덤즈	LilithGames
13	A3: 스틸얼라이브	Netmarble
14	AFK 아레나	LilithGames
15	리니지2 레볼루션	Netmarble
16	일루전 커넥트	ChangYou
17	라그나로크 오리진	GRAVITY Co., Ltd.
18	그랑삼국	YOUZU(SINGAPORE)PTE.LTD.
19	슬램덩크	DeNA HONG KONG LIMITED
20	Pmang Poker : Casino Royal	NEOWIZ corp
21	FIFA Mobile	NEXON Company
22	FIFA ONLINE 4 M by EA SPORTS™	NEXON Company
23	스테리테일	4399 KOREA
24	Epic Seven	Smilegate Megaport
25	Summoners War	Com2uS
26	컴투스프로야구2020	Com2uS
27	랑그릿사	ZlongGames
28	PUBG MOBILE	PUBG CORPORATION
29	검은사막 모바일	PEARL ABYSS
30	Roblox	Roblox Corporation
31	메이플스토리M	NEXON Company
32	마구마구 2020	Netmarble
33	동방불패 모바일	Perfect World Korea
34	한게임 포커	NHN BIGFOOT
35	Empires & Puzzles: Epic Match 3	Small Giant Games
36	Brawl Stars	Supercell
37	Age of Z Origins	Camel Games Limited
38	Cookie Run: OvenBreak - Endless Running Platformer	Devsisters Corporation
39	케페우스M	Ujoy Games
40	Gardenscapes	Playrix
41	황제라 칭하라	Clicktouch Co., Ltd.
42	프린세스 커넥트! Re:Dive	Kakao Games Corp.
43	Rise of Empires: Ice and Fire	Long Tech Network Limited
44	Lord of Heroes	CloverGames
45	뮤오리진2	Webzen Inc.
46	Random Dice: PvP Defense	111%
47	Lords Mobile: Kingdom Wars	IGG.COM
48	Homescapes	Playrix
49	에오스 레드	BluePotion Games
50	파이브스타즈	SkyPeople

'Game Rank' 카테고리의 다른 글

2020년 10월 6일 플레이스토어 게임 매출 순위 (0)	2020.10.06
2020년 10월 5일 플레이스토어 게임 매출 순위 (1)	2020.10.05
2020년 10월 3일 플레이스토어 게임 매출 순위 (0)	2020.10.03
2020년 10월 2일 플레이스토어 게임 매출 순위 (0)	2020.10.02
2020년 10월 1일 플레이스토어 게임 매출 순위 (0)	2020.10.01

Rank	Game	Publisher
1	리니지M	NCSOFT
2	리니지2M	NCSOFT
3	바람의나라: 연	NEXON Company
4	KartRider Rush+	NEXON Company
5	R2M	Webzen Inc.
6	V4	NEXON Company
7	기적의 검	4399 KOREA
8	Genshin Impact	miHoYo Limited
9	뮤 아크엔젤	Webzen Inc.
10	블레이드&소울 레볼루션	Netmarble
11	가디언 테일즈	Kakao Games Corp.
12	라이즈 오브 킹덤즈	LilithGames
13	A3: 스틸얼라이브	Netmarble
14	AFK 아레나	LilithGames
15	리니지2 레볼루션	Netmarble
16	일루전 커넥트	ChangYou
17	라그나로크 오리진	GRAVITY Co., Ltd.
18	그랑삼국	YOUZU(SINGAPORE)PTE.LTD.
19	슬램덩크	DeNA HONG KONG LIMITED
20	Pmang Poker : Casino Royal	NEOWIZ corp
21	FIFA Mobile	NEXON Company
22	FIFA ONLINE 4 M by EA SPORTS™	NEXON Company
23	스테리테일	4399 KOREA
24	Epic Seven	Smilegate Megaport
25	Summoners War	Com2uS
26	컴투스프로야구2020	Com2uS
27	랑그릿사	ZlongGames
28	PUBG MOBILE	PUBG CORPORATION
29	검은사막 모바일	PEARL ABYSS
30	Roblox	Roblox Corporation
31	메이플스토리M	NEXON Company
32	마구마구 2020	Netmarble
33	동방불패 모바일	Perfect World Korea
34	한게임 포커	NHN BIGFOOT
35	Empires & Puzzles: Epic Match 3	Small Giant Games
36	Brawl Stars	Supercell
37	Age of Z Origins	Camel Games Limited
38	Cookie Run: OvenBreak - Endless Running Platformer	Devsisters Corporation
39	케페우스M	Ujoy Games
40	Gardenscapes	Playrix
41	황제라 칭하라	Clicktouch Co., Ltd.
42	프린세스 커넥트! Re:Dive	Kakao Games Corp.
43	Rise of Empires: Ice and Fire	Long Tech Network Limited
44	Lord of Heroes	CloverGames
45	뮤오리진2	Webzen Inc.
46	Random Dice: PvP Defense	111%
47	Lords Mobile: Kingdom Wars	IGG.COM
48	Homescapes	Playrix
49	에오스 레드	BluePotion Games
50	파이브스타즈	SkyPeople

'Game Rank' 카테고리의 다른 글

2020년 10월 5일 플레이스토어 게임 매출 순위 (1)	2020.10.05
2020년 10월 4일 플레이스토어 게임 매출 순위 (0)	2020.10.04
2020년 10월 2일 플레이스토어 게임 매출 순위 (0)	2020.10.02
2020년 10월 1일 플레이스토어 게임 매출 순위 (0)	2020.10.01
2020년 9월 30일 플레이스토어 게임 매출 순위 (0)	2020.09.30

확실한 데이터 과학자 환경 설정

David Adrián Cañones

5 월 4 일 · 13최소 읽기

Two Data Scientists at work — 직장에서 두 명의 데이터 과학자

소개 및 동기

In this post I would like to describe in detail our setup and development environment (hardware & software) and how to get it, step by step.

저는 많은 회사에서 거의 변경 (주로 하드웨어 개선)없이이 설정을 5 년 이상 사용해 왔으며 수십 개의 데이터 프로젝트 개발에 도움을주었습니다. 사용하는 동안 하나의 기능을 놓치지 않았습니다. 이것은 표준 설정입니다.페드로과나를사용WhiteBox.

왜이 가이드인가? 시간이 지남에 따라 몇 가지 기본 기능을 갖춘 견고한 환경을 찾고있는 많은 학생과 동료 데이터 과학자를 발견했습니다.

Python, R 및 해당 라이브러리와 같은 표준 데이터 과학 도구는 설치 및 유지 관리가 쉽습니다.
대부분의 라이브러리는 추가 구성없이 바로 작동합니다.
스몰 데이터에서 빅 데이터까지, 그리고 표준 머신 러닝 모델에서 딥 러닝 프로토 타이핑에 이르기까지 데이터 관련 작업의 전체 스펙트럼을 다룰 수 있습니다.
값 비싼 하드웨어와 소프트웨어를 구입하기 위해 은행 계좌를 깰 필요가 없습니다.

하드웨어

노트북에는 다음이 있어야합니다.

최소 16GB RAM. 이는 Dask 또는 Spark와 같은 도구를 사용하지 않고 메모리에서 쉽게 처리 할 수있는 데이터의 양을 제한하므로 가장 중요한 기능입니다. 많을수록 좋습니다. 여유가 있다면 32GB를 사용하십시오.
강력한 프로세서. 코어가 4 개인 Intel i5 또는 i7 이상. 명백한 이유로 데이터를 처리하는 동안 많은 시간을 절약 할 수 있습니다.
최소 4GB RAM의 NVIDIA GPU. 간단한 딥 러닝 모델을 프로토 타입하거나 미세 조정해야하는 경우에만 가능합니다. 해당 작업에 대해 거의 모든 CPU보다 훨씬 빠릅니다.랩톱에서 처음부터 심각한 딥 러닝 모델을 훈련 할 수는 없습니다..
좋은 냉각 시스템. 최소한 몇 시간 동안 워크로드를 실행합니다. 노트북이 녹지 않고 처리 할 수 있는지 확인하십시오.
256GB 이상의 SSD이면 충분합니다.
더 큰 SSD, 더 많은 RAM을 추가하거나 배터리를 쉽게 교체하는 등의 기능을 업그레이드 할 수 있습니다.

개인적으로 추천하는 것은 중고 Thinkpad 워크 스테이션 노트북입니다. 나는 초침이있다P50위에 나열된 모든 기능을 충족하는 500 유로에 구입했습니다.

Thinkpad는 우리가 수년 동안 사용해 왔지만 결코 실패한 적이없는 뛰어난 전문가 용 노트북입니다. 핸디캡은 가격이지만 많은 대기업이 임대 계약을 맺고 2 년마다 랩톱을 폐기하므로 사용 조건이 매우 좋은 중고 씽크 패드를 많이 찾을 수 있습니다. 이러한 노트북 중 상당수는 중고 시장에서 끝납니다. 다음에서 검색을 시작할 수 있습니다.

이러한 중고 시장의 대부분은 보증 및 송장을 제공 할 수 있습니다 (귀하가 회사 인 경우). 이 게시물을 읽고 중대형 조직에 속한 경우 가장 좋은 방법은 제조업체와 직접 임대 계약에 도달하는 것입니다.

Apple MacBook을 피하십시오:

여러 가지 이유로 OSX를 정말 좋아하지 않는 한 Apple 노트북을 피해야합니다. 그들은디자인 분야의 전문가와 음악 제작자를위한, 사진 작가, 동영상 및 사진 편집자, UX / UI, 웹 개발자와 같이 무거운 작업을 실행할 필요가없는 개발자도 마찬가지입니다. 2011 년부터 2016 년까지 제 주 노트북은 맥북 이었기 때문에 그 한계를 잘 알고 있습니다. 하나를 사지 않는 주된 이유는 다음과 같습니다.

동일한 하드웨어에 대해 훨씬 더 많은 비용을 지불하게됩니다.
끔찍한 공급 업체 종속으로 인해 다른 대안으로 변경하는 데 막대한 비용이 듭니다.
NVIDIA GPU를 사용할 수 없으므로 랩톱에서 딥 러닝 프로토 타이핑은 잊어 버리십시오.
마더 보드에 납땜되어 있으므로 하드웨어를 업그레이드 할 수 없습니다. 더 많은 RAM이 필요한 경우 새 노트북을 구입해야합니다.

울트라 북 (일반)을 피하십시오:

대부분의 울트라 북은 가벼운 워크로드, 웹 브라우징, 사무 생산성 소프트웨어 등을 위해 설계되었습니다. 대부분은 위에 나열된 냉각 시스템 요구 사항을 충족하지 못하며 수명이 짧습니다. 또한 업그레이드 할 수 없습니다.

운영 체제

데이터 과학 용으로 사용되는 운영 체제는 Ubuntu의 최신 LTS (장기 지원)입니다. 이 글을 쓰는 시점에서Ubuntu 20.04 LTS최신입니다.

Ubuntu는 데이터 과학자로서 다른 운영 체제 및 기타 Linux 배포판에 비해 몇 가지 이점을 제공합니다.

가장 성공적인 데이터 과학 도구는 오픈 소스이며 무료 오픈 소스 인 Ubuntu에서 설치 및 사용하기 쉽습니다. 이러한 도구의 대부분의 개발자는 아마도 Linux를 사용하고있을 것입니다. TensorFlow, PyTorch 등과 같은 GPU를 지원하는 딥 러닝 프레임 워크의 경우 특히 그렇습니다.
데이터 작업을 할 때 보안은 설정의 핵심이되어야합니다. Linux는 기본적으로 Windows 또는 OS X보다 더 안전하며 소수의 사람들이 사용하기 때문에 대부분의 악성 소프트웨어는 Linux에서 실행되도록 설계되지 않았습니다.
데스크탑과 서버 모두에서 가장 많이 사용되는 Linux 배포판이며 훌륭하고 지원적인 커뮤니티입니다. 제대로 작동하지 않는 문제를 발견하면 도움을 받거나 문제 해결 방법에 대한 정보를 찾는 것이 더 쉽습니다.
대부분의 서버는 Linux 기반입니다., 해당 서버에 코드를 배포하고 싶을 것입니다. 프로덕션에 배포 할 환경에 가까울수록 좋습니다. 이것이 Linux를 플랫폼으로 사용하는 주된 이유 중 하나입니다.
그것은 위대한패키지 관리자거의 모든 것을 설치하는 데 사용할 수 있습니다.

Ubuntu 설치시주의 사항 :

운이 좋으면 전용 GPU를 사용할 수있는 경우 그래픽 전용 드라이버를 설치하지 마십시오 (설치하는 동안 확인란이 선택되지 않음). 기본 드라이버가 버그가 있고 특정 GPU (제 경우와 같이)에서 외부 모니터가 제대로 작동하지 않을 수 있으므로 나중에 설치할 수 있습니다.

Uncheck third-party software while installation — 설치하는 동안 타사 소프트웨어 선택 취소

설치 USB를 올바르게 만드십시오. Linux에 액세스 할 수 있으면 사용할 수 있습니다.시동 디스크 작성기, Windows 또는 OSX의 경우balenaEtcher확실한 선택입니다.

NVIDIA 드라이버

NVIDIA Linux 지원은 수년 동안 커뮤니티의 불만 중 하나였습니다. 유명한 것을 기억하십시오.

NVIDIA: 젠장!

Linus Torvalds가 NVIDIA에 대해 이야기

운 좋게도 상황이 바뀌었고 지금은 여전히 엉덩이에 통증이 있지만 모든 것이 더 쉽습니다.

다음은 NVIDIA 드라이버를 설치해야하는 방법입니다.

1. 추가독점 GPU 드라이버 PPA시스템에 :

sudo add-apt-repository ppa:graphics-drivers/ppa

2. 사용 가능한 최신 드라이버를 설치합니다 (이 게시물 작성 당시에는 440,탭사용 가능한 옵션을 확인하는 키) :

sudo apt install nvidia-driver-440

설치가 완료 될 때까지 기다렸다가 PC를 재부팅하십시오.

이제 NVIDIA X 서버 설정에 액세스 할 수 있습니다.

이를 사용하여 절전 모드 (딥 러닝 작업을 수행하지 않을 경우 유용함)와 성능 모드 (GPU를 사용할 수 있지만 배터리를 소모 함)간에 전환 할 수 있습니다. 온 디맨드 모드는 여전히 제대로 작동하지 않으므로 피하십시오.

NVIDIA X Server Settings — NVIDIA X 서버 설정

또한 실행할 수 있어야합니다.nvidia-smiGPU 워크로드 (사용량, 온도, 메모리)에 대한 정보를 표시하는 애플리케이션입니다. GPU에서 딥 러닝 모델을 훈련하는 동안 많이 사용할 것입니다.

단말기

기본 그놈 터미널은 괜찮지 만터미네이터, 터미널 창을 수직으로 분할 할 수있는 강력한 터미널 에뮬레이터CTR+시프트+이자형수평으로CTR+시프트+영형, 동시에 여러 터미널에 명령을 브로드 캐스트합니다. 다양한 서버 또는 클러스터를 설정하는 데 유용합니다.

다음과 같이 Terminator를 설치합니다.

sudo apt install terminator

VirtualBox

VirtualBox는 다음을 실행할 수있는 소프트웨어입니다.사실상현재 운영 체제 세션 내의 다른 컴퓨터. 다른 운영 체제를 실행할 수도 있습니다 (Linux 내부의 Windows 또는 그 반대).

다음과 같은 BI 및 Dashboarding 도구와 같이 Linux에서 사용할 수없는 특정 소프트웨어가 필요한 경우에 유용합니다.

VirtualBox는 VM (Virtual Machine)을 만들고 필요한 것을 테스트하고 삭제할 수 있기 때문에 시스템을 손상시키지 않고 새로운 라이브러리와 소프트웨어를 테스트하는데도 유용합니다.

VirtualBox를 설치하려면 터미널을 열고 다음을 작성하십시오.

sudo apt install virtualbox

사용법은 매우 간단하지만 마스터하기는 어렵습니다. VirtualBox에 대한 광범위한 자습서를 보려면이.

Python, R 등 (Miniconda 포함)

Python은 이미 Ubuntu에 포함되어 있습니다. 하지만 너시스템 Python을 사용하거나 시스템 전체에 분석 라이브러리를 설치해서는 안됩니다.. 그렇게하면 시스템 파이썬이 깨질 수 있으며, 고치는 것은 어렵습니다.

저는 고립 된 가상 환경을 만드는 것을 선호합니다. 문제가 발생할 경우를 대비하여 삭제하고 다시 만들 수 있습니다. 이를 위해 사용할 수있는 가장 좋은 도구는콘다:

모든 언어 (Python, R, Ruby, Lua, Scala, Java, JavaScript, C / C ++, FORTRAN 등)에 대한 패키지, 종속성 및 환경 관리.

많은 사람들이 conda를 사용하지만 그것이 어떻게 작동하고 무엇을하는지 정말로 이해하는 사람은 거의 없습니다. 좌절감으로 이어질 수 있습니다.

conda는 두 가지 유형으로 제공됩니다.

Anaconda : conda 패키지 관리자 포함과많은 라이브러리 (500Mb)가 설치되었습니다. 이러한 라이브러리를 모두 사용하지는 않을 것이며 며칠 안에 구식이 될 것입니다. 나는이 풍미를 가지고가는 것을 추천하지 않는다.
Miniconda : conda 패키지 관리자 만 포함합니다. conda 또는 pip를 통해 모든 기존 라이브러리에 계속 액세스 할 수 있지만 해당 라이브러리는 필요할 때 다운로드 및 설치됩니다. 이 옵션을 사용하면 시간과 메모리를 절약 할 수 있습니다.

Miniconda 설치 스크립트 다운로드여기그것을 실행하십시오 :

bash Miniconda3-latest-Linux-x86_64.sh

conda를 초기화했는지 확인하십시오.예설치 스크립트가 물어볼 때!) 해당 줄이.bashrc파일:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/david/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/david/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/david/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/david/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

콘다 앱이 추가됩니다.통로언제든지 액세스 할 수 있습니다. conda가 제대로 설치되었는지 확인하려면 다음을 입력하십시오.콘다터미널에서 :

콘다 도움말

conda 가상 환경에서 원하는 Python 버전은 물론 R, Java, Julia, Scala 등을 설치할 수 있습니다.

conda 및 pip 패키지 관리자 모두에서 라이브러리를 설치할 수도 있으며 동일한 가상 환경에서 완벽하게 호환되므로 둘 중 하나를 선택할 필요가 없습니다.

conda에 대해 한 가지 더 :

conda는 코드 배포를위한 고유 한 기능을 제공합니다. 라는 도서관입니다콘다 팩그리고 그것은절대로 필요한 것우리를 위해. 여러 번 액세스 할 수없는 인터넷 격리 클러스터에 라이브러리를 배포하는 데 도움이되었습니다.씨, 아니python3필요한 것을 설치하는 간단한 방법이 없습니다.

이 라이브러리를 사용하면.tar.gz원하는 곳에서 압축을 풀 수 있습니다. 그런 다음 환경을 활성화하고 평소처럼 사용할 수 있습니다.

설치 및 사용하기콘다 팩이것을 방문하십시오링크:

conda install -c conda-forge conda-pack

이는 주어진 환경에서 작업 할 수있는 올바른 권한을 부여하지 않고 프로젝트 요구 사항에 맞게 구성 할 시간이없는 게으른 IT 직원에 대한 궁극적 인 무기입니다. ssh가 서버에 액세스 할 수 있습니까? 그런 다음 원하고 필요한 환경을 갖게됩니다.

다음은 공식 문서의 데모입니다.

https://asciinema.org/a/186862

Jupyter

Jupyter는 대화 형 프로그래밍 환경이 필요한 개발을 위해 데이터 과학자에게 필수입니다.

제가 몇 년 동안 배운 비결은 로컬 JupyterHub 서버를 만들고 시스템 서비스로 구성하여 매번 서버를 시작할 필요가 없도록하는 것입니다 (노트북이 시작 되 자마자 서버가 항상 가동되고 대기합니다). 또한 모든 환경에서 Python / R 커널을 감지하고 Jupyter에서 자동으로 사용할 수 있도록하는 라이브러리를 설치합니다.

이것을하기 위해:

1. 먼저 conda 가상 환경을 만듭니다 (일반적으로jupyter_env) :

conda create -n jupyter_env

2. 환경 활성화 :

conda activate jupyter_env

3. Python을 설치합니다.

conda install python=3.7

4. 필요한 라이브러리를 설치합니다.

conda install -c conda-forge jupyterhub jupyterlab nodejs nb_conda_kernels

5. 서비스 파일 생성sudo nano /etc/systemd/system/jupyterhub.service내용과 함께 (적응 경로, 변경<your_user>사용자 이름으로) :

[Unit]
Description=JupyterHub
After=network.target[Service]
User=root
Environment="PATH=/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/home/<your_user>/miniconda3/envs/jupyter_env/bin:/home/<your_user>/miniconda3/bin"
ExecStart=/home/<your_user>/miniconda3/envs/jupyter_env/bin/jupyterhub[Install]
WantedBy=multi-user.target

6. 서비스 데몬을 다시로드합니다.

sudo systemctl daemon-reload

7. 시작주피터 허브서비스:

sudo systemctl start jupyterhub

8. 활성화주피터 허브서비스이므로 부팅시 자동으로 시작됩니다.

sudo systemctl enable jupyterhub

9. 이제 다음으로 이동할 수 있습니다.localhost : 8000및 로그인Linux 사용자 및 비밀번호로:

JupyterHub login page — JupyterHub 로그인 페이지

10. 로그인 후 클래식 모드에서 본격적인 Jupyter 서버에 액세스 할 수 있습니다 (/나무) 또는 최신 JupyterLab (/랩) :

이 Jupyter 설정의 가장 흥미로운 기능은 모든 conda 환경에서 커널을 감지하므로 여기에서 번거 로움없이 해당 커널에 액세스 할 수 있다는 것입니다. 해당 커널을 설치하십시오.원하는 환경에서(conda 설치 ipykernel, 또는conda 설치 irkernel) 및 JupyterHub 제어판에서 Jupyter 서버를 다시 시작합니다.

JupyterHub control panel — JupyterHub 제어판

커널을 설치하려는 환경을 이전에 활성화해야합니다! (conda activate <env_name>).

십오 일

파이썬

파이썬은 우리의 기본 언어입니다.WhiteBox.

아시다시피 저는FOSS특히 데이터 생태계의 솔루션입니다. 소수 중 하나소유권여기서 추천 할 소프트웨어는 우리가 사용하는 소프트웨어입니다.IDE:PyCharm. 코드에 대해 진지하게 생각한다면 PyCharm과 같은 IDE를 사용하고 싶을 것입니다.

코드 완성 및 환경 검사.
네이티브 conda 지원을 포함한 Python 환경.
디버거.
Docker 통합.
Git 통합.
과학 모드 (pandas DataFrame 및 NumPy 배열 검사).

다른 인기있는 선택에는 Pycharm의 안정성과 기능이 없습니다.

Visual Studio Code : IDE보다 텍스트 편집기에 가깝습니다. 플러그인을 사용하여 확장 할 수 있다는 것을 알고 있지만 PyCharm만큼 강력하지는 않습니다. 여러 언어로 된 프로젝트가있는 웹 개발자라면 Visual Studio Code가 좋은 선택 일 수 있습니다. 웹 개발자이고 Python이 백엔드 용으로 선택한 언어라면 데이터에 있지 않더라도 Pycharm을 사용하십시오.
Jupyter: if you have doubts about when you should be using Jupyter or PyCharm and call yourself a Data <whatever>, please attend 부트 캠프 중 하나우리는 최대한 빨리 가르칩니다.

PyCharm 설치에 대한 우리의 조언은스냅이므로 설치가 자동으로 업데이트되고 나머지 시스템과 격리됩니다. 커뮤니티 (무료) 버전 :

sudo snap install pycharm-community --classic

스칼라

Scala는 기본 Spark로 빅 데이터 프로젝트에 사용하는 언어이지만 PySpark로 전환하고 있습니다.

여기에서 우리의 추천은IntelliJ IDEA. PyCharm 개발자 (JetBrains)의 JVM 기반 언어 (Java, Kotlin, Groovy, Scala) 용 IDE입니다. 가장 좋은 기능은 Scala에 대한 기본 지원과 PyCharm과의 유사점입니다. Eclipse에서 온 경우 키 바인딩 및 단축키를 수정하여 Eclipse를 복제 할 수 있습니다.

커뮤니티 (무료) 버전을 설치하려면 :

sudo snap install intellij-idea-community --classic

빅 데이터

좋아, 당신은 당신의 노트북에서 실제로 빅 데이터를 수행하지 않을 것입니다. 빅 데이터 프로젝트에 참여하는 경우 회사 또는 고객이 적절한 Hadoop 클러스터를 제공 할 것입니다.

그러나 랩톱 메모리에 쉽게 맞지 않는 데이터로 모델을 분석하거나 만들려는 상황이 있습니다. 이러한 경우 로컬 Spark 설치가 매우 유용합니다. 저의 겸손한 노트북을 사용하여 디스크에 GB 크기의 데이터 세트를 분쇄했습니다.

다음은 노트북에서 Spark를 시작하고 실행하기위한 권장 사항입니다.

1. conda 환경을 만들거나 활성화합니다.

2. PySpark 및 OpenJDK를 설치합니다.

conda install pyspark openjdk

3. 로컬 스파크 사용 :

from pyspark.sql import SparkSession

spark = SparkSession.builder. \
	appName('your_app_name'). \
	config('spark.sql.session.timeZone', 'UTC'). \
	config('spark.driver.memory', '16G'). \
	config('spark.driver.maxResultSize', '2G'). \
	getOrCreate()

여기에 훌륭한 언급이 있습니다 :

Dask: 일을 많이 단순화하는 Dask는 일종의 Python 용 기본 Spark입니다. pandas 및 NumPy API에 더 가깝지만 경험상 Spark만큼 강력하지는 않습니다. 우리는 때때로 그것을 사용합니다.
모딘: 멀티 코어 및 아웃 오브 코어 계산을 지원하는 Pandas API를 복제합니다. 많은 코어 (32, 64)가있는 강력한 분석 서버에서 작업하고 pandas를 사용하려는 경우 특히 유용합니다. 일반적으로 코어 당 성능이 좋지 않기 때문에 Modin을 사용하면 계산 속도를 높일 수 있습니다.

데이터베이스 도구

때로는 다양한 DB 기술과 연결하고 쿼리를 만들고 데이터를 탐색 할 수있는 도구가 필요합니다. 우리의 선택은DBeaver:

DBeaver는 다양한 데이터베이스의 드라이버를 자동으로 다운로드하는 도구입니다. 다음을 지원합니다.

데이터베이스, 스키마, 테이블 및 열 이름 완성.
연결을위한 고급 네트워킹 요구 사항 (예 : SSH 터널 등)

다음과 같이 DBeaver를 설치할 수 있습니다.

sudo snap install dbeaver-ce

가작 :

DataGrip: JetBrains의 데이터베이스 IDE는 때때로 DBeaver와 매우 유사하며 지원되는 기술은 적지 만 매우 안정적입니다.sudo snap install datagrip --classic

기타

우리에게 중요한 기타 특정 도구 및 앱은 다음과 같습니다.

스포티 파이:sudo 스냅 설치 spotify.
느슨하게:sudo 스냅 설치 슬랙-클래식.
전보 데스크탑 :sudo snap install telegram-desktop.
Nextcloud :sudo apt install nextcloud-desktop nautilus-nextcloud.
Thunderbird Mail : 기본적으로 설치됩니다.
Zoom : 다음에서 수동으로 다운로드여기.
Google 크롬 : 다음에서 수동으로 다운로드여기.

그리고 이것이 전부입니다. 그것들은 많은 도구이고 우리는 아마도 뭔가를 잊었을 것입니다. 여기에 일부 카테고리를 놓치거나 조언이 필요한 경우 의견을 남겨 주시면 게시물을 연장하겠습니다.

원래 게시 된 게시물 :https://davidadrian.cc

'Data Analytics(ko)' 카테고리의 다른 글

New Features in Python 3.9 -번역 (0)	2020.10.05
No More Basic Plots Please -번역 (0)	2020.10.05
Extracting Data from PDF File Using Python and R -번역 (0)	2020.10.02
Advanced Python: Itertools Library — The Gem Of Python Language -번역 (0)	2020.10.01
Data Visualisation using Pandas and Plotly -번역 (0)	2020.09.30

The Definitive Data Scientist Environment Setup

David Adrián Cañones

May 4 · 13 min read

Intro and motivation

In this post I would like to describe in detail our setup and development environment (hardware & software) and how to get it, step by step.

I have been using this setup for more than 5 years with little changes (mainly hardware improvements), in many companies, and helped me in the development of dozens of Data projects. Never missed a single feature while using it. This is the standard setup both Pedro and me use at WhiteBox.

Why this guide? Over time, we found many students and fellow Data Scientists looking for a solid environment with some fundamental features:

Standard Data Science tools like Python, R, and its libraries are easy to install and maintain.
Most libraries just work out of the box with little extra configuration.
Allows to cover the full spectrum of Data related tasks, from Small to Big Data, and from standard Machine Learning models to Deep Learning prototyping.
Do not need to break your bank account to buy expensive hardware and software.

Hardware

Your laptop should have:

At least 16GB of RAM. This is the most important feature as it will limit the amount of data you can easily process in memory (without using tools like Dask or Spark). The more the better. Go with 32GB if you can afford it.
A powerful processor. At least an Intel i5 or i7 with 4 cores. It will save you a lot of time while processing data for obvious reasons.
A NVIDIA GPU of at least 4GB of RAM. Only if you need to prototype or fine-tune simple Deep Learning models. It will be orders of magnitude faster than almost any CPU for that task. Remember that you can’t train serious Deep Learning models from scratch in a laptop.
A good cooling system. You are going to run workloads for at least hours. Make sure your laptop can handle it without melting.
A SSD of at least 256GB should be enough.
Possibility to upgrade its capabilities, like adding a bigger SSD, more RAM, or easily replace battery.

My personal recommendation is getting a second hand Thinkpad workstation laptop. I have an second hand P50 I bought for 500€ which meets all features listed above:

Thinkpads are excellent professional laptops we have been using for years and never failed us. Its handicap is the price, but you can find lots of second hand Thinkpads in very good using conditions as many big corporations have leasing agreements and dispose laptops every 2 years. Many of these laptops end in the second hand market. You can start your search in:

Many of these second hand markets can provide warranty and an invoice (in case you are a company). If you are reading this post and belong to a middle to big sized organization, the best option for you is probably reaching a leasing agreement directly with the manufacturer.

Avoid Apple MacBooks:

For a variety of reasons, you should avoid an Apple laptops unless you really (and I mean, really) love OSX. They are intended for professionals from the design field and music producers, like photographers, video and photo editors, UX/UI, and even developers who don’t need to run heavy workloads, like Web Developers. My main laptop from 2011 to 2016 was a MacBook, so I know its limitations very well. Main reasons not to buy one are:

You are going to pay much more for the same hardware.
You will suffer a terrible vendor lock-in, which means a huge cost to change to other alternative.
You can’t have a NVIDIA GPU, so forget about Deep Learning prototyping in your laptop.
Can not upgrade its hardware as it is soldered to the motherboard. In case you need more RAM, you have to buy a new laptop.

Avoid Ultrabooks (in general):

Most ultrabooks are designed for light workloads, web browsing, office productivity software and similar. Most of them does not meet the cooling system requirement listed above, and its life will be short. They are also not upgradable.

Operating System

Our go-to operating system for Data Science is the latest LTS (Long Term Support) of Ubuntu. At the time of writing this post, Ubuntu 20.04 LTS is the latest.

Ubuntu offers some advantages over other operating systems and other Linux distros for you as a Data Scientist:

Most successful Data Science tools are open-source and are easy to install and use in Ubuntu, which is also free an open-source. It makes sense as most developers of those tools are probably using Linux. It is specially true when it comes to Deep Learning frameworks with GPU support, like TensorFlow, PyTorch, etc.
As you are going to be working with Data, security must be at the core of your setup. Linux is by default, more secure than Windows or OS X, and as it is used by a minority of people, most malicious software is not designed to run on Linux.
It is the most used Linux distro, both for desktops and servers, with a great and supportive community. When you find something is not working well, it will be easier to get help or find info about how to fix it.
Most servers are Linux based, and you probably want to deploy your code in those servers. The closer you are to the environment you are going to deploy to production, the better. This is one of the main reasons to use Linux as your platform.
It has a great package manager you can use to install almost everything.

Some caveats to install Ubuntu:

If you are lucky enough to have a dedicated GPU, do not install proprietary drivers for graphics (unchecked box while installation). You can install later as default drivers are buggy and may cause external monitors to not work properly for certain GPU’s (like in my case):

Create your installation USB properly, if you have access to Linux, can use Startup Disk Creator, for Windows or OSX, balenaEtcher is a solid choice.

NVIDIA Drivers

NVIDIA Linux support has been one of the complaints of the community for years. Remember that famous:

NVIDIA: FUCK YOU!

Linus Torvalds talking about NVIDIA

Luckily, things have changed and now, although still a pain in the ass sometimes, everything is easier.

This is how you must install NVIDIA drivers:

1. Add proprietary GPU drivers PPA to your system:

sudo add-apt-repository ppa:graphics-drivers/ppa

2. Install latest available drivers (440 at the time of writing this post, use TAB key to check for available options):

sudo apt install nvidia-driver-440

Wait for the installation to finish and reboot your PC.

You should be able now to access NVIDIA X Server Settings:

You can use this to switch between Power Saving Mode (useful if you are not going to do any Deep Learning Stuff) and Performance Mode (allows you to use GPU, but drains your battery). Avoid On-Demand mode as it is still not working properly.

You should also be able to run nvidia-smi application, which displays information about GPU workloads (usage, temperature, memory). You are going to use it a lot while training Deep Learning models on GPU.

Terminal

While default GNOME Terminal is OK, I prefer using Terminator, a powerful terminal emulator which allows you to split the terminal window vertically ctr + shift + e and horizontally ctr + shift + o, as well as broadcasting commands to several terminals at the same time. This is useful to setup various servers or a cluster.

Install Terminator like this:

sudo apt install terminator

VirtualBox

VirtualBox is a software that allows you to run virtually other machines inside your current operating system session. You can also run different operating systems (Windows inside Linux, or the other way around).

It is useful in case you need a specific software which is not available for Linux, like BI and Dashboarding tools like:

VirtualBox is also useful to test new libraries and software without compromising your system, as you can just create a VM (Virtual Machine), test whatever you need and delete it.

To install VirtualBox, open your terminal and write:

sudo apt install virtualbox

Although its usage is fairly simple, it is hard to master. For an extensive tutorial of VirtualBox, check this.

Python, R and more (with Miniconda)

Python is already included with Ubuntu. But you should never use system Python or install your analytics libraries system-wide. You can break system Python doing that, and fixing it is hard.

I prefer creating isolated virtual environments I can just delete and create again in case something goes wrong. The best tool you can use to do that is conda:

Package, dependency and environment management for any language-Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Although many people uses conda, few people really understand how it works and what it does. It may lead to frustration.

conda is shipped in two flavors:

Anaconda: includes conda package manager and a lot of libraries (500Mb) installed. You are not going to use all those libraries and moreover will be outdated in a few days. I do not recommend going with this flavor.
Miniconda: includes just conda package manager. You still have access to all existing libraries through conda or pip, but those libraries will be downloaded and installed when they are needed. Go with this option, as it will save you time and memory.

Download Miniconda install script from here and run it:

bash Miniconda3-latest-Linux-x86_64.sh

Make sure you initialize conda (so answer yes when install script asks!) and those lines are added to your .bashrc file:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/david/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/david/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/home/david/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/home/david/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

This will add the conda app to your PATH so you can access it anytime. To check conda is properly installed, just type conda in your terminal:

conda help

Remember that in a conda virtual environment you can install whatever Python version you want, as well as R, Java, Julia, Scala, and more…

Remember that you can also install libraries from both conda and pip package managers and don’t have to choose one of them as they are perfectly compatible in the same virtual environment.

One more thing about conda:

conda offers a unique feature for deploying your code. It is a library called conda-pack and it is a must for us. It helped us many times to get our libraries deployed in internet isolated clusters with no access to pip, no python3 and no simple way to install anything you need.

This library allows you to create a .tar.gz file with your environment you can just uncompress wherever you want. Then you can just activate the environment and use it as usual.

To install and use conda-pack visit this link:

conda install -c conda-forge conda-pack

This is the ultimate weapon against lazy IT guys who don’t give you the right permissions to work in a given environment and don’t have time to configure it to suit your project needs. Have ssh access to a server? Then you have the environment you want and need.

Here is a demo from official documentation:

https://asciinema.org/a/186862

Jupyter

Jupyter is a must for a Data Scientist, for developments where you need an interactive programming environment.

A trick I learned over the years is to create a local JupyterHub server and configure as a system service so I don’t have to launch the server every time (it is always up and waiting as soon as the laptop starts). I also install a library that detects Python/R kernels in all my environments and automatically make them available in Jupyter.

To do this:

1. First create a conda virtual environment (I usually call it jupyter_env):

conda create -n jupyter_env

2. Activate the environment:

conda activate jupyter_env

3. Install Python:

conda install python=3.7

4. Install needed libraries:

conda install -c conda-forge jupyterhub jupyterlab nodejs nb_conda_kernels

5. Create a service file sudo nano /etc/systemd/system/jupyterhub.service with the content ( adapt paths, changing <your_user> with your user name):

[Unit]
Description=JupyterHub
After=network.target[Service]
User=root
Environment="PATH=/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/home/<your_user>/miniconda3/envs/jupyter_env/bin:/home/<your_user>/miniconda3/bin"
ExecStart=/home/<your_user>/miniconda3/envs/jupyter_env/bin/jupyterhub[Install]
WantedBy=multi-user.target

6. Reload the service daemon:

sudo systemctl daemon-reload

7. Start jupyterhub service:

sudo systemctl start jupyterhub

8. Enable jupyterhub service, so it starts automatically at boot time:

sudo systemctl enable jupyterhub

9. Now you can go to localhost:8000 and login with your Linux user and password:

10. After login you have access to a fully fledged Jupyter server at classic mode (/tree) or the more recent JupyterLab (/lab):

The most interesting feature of this Jupyter setup is that it detects kernels in all conda environments, so you can access those kernels from here with no hassle. Just install the corresponding kernel in the desired environment ( conda install ipykernel, or conda install irkernel) and restart Jupyter server from JupyterHub control panel:

Remember to previously activate the environment where you want to install the kernel! ( conda activate <env_name>).

IDEs

Python

Python is our primary language at WhiteBox.

As you probably know I am a supporter of FOSS solutions, specially in the Data ecosystem. One of the few proprietary software I am going to recommend here, is the one we use as IDE: PyCharm. If you are serious about your code, you want to use an IDE like PyCharm:

Code completion and environment introspection.
Python environments, including native conda support.
Debugger.
Docker integration.
Git integration.
Scientific mode (pandas DataFrame and NumPy arrays inspection).

Other popular choices does not have the stability and features of Pycharm:

Visual Studio Code: is more a Text Editor than an IDE. I know you can extend it using plugins, but is not as powerful as PyCharm. If you are a Web Dev with projects in multiple languages, Visual Studio Code may be a good choice for you. If you are a Web Developer and Python is your language of choice for back-end, go with Pycharm even if you are not in Data.
Jupyter: if you have doubts about when you should be using Jupyter or PyCharm and call yourself a Data <whatever>, please attend one of the bootcamps we teach asap.

Our advice for installing PyCharm is using Snap, so your installation will be automatically updated and isolated from the rest of the system. For community (free) version:

sudo snap install pycharm-community --classic

Scala

Scala is a language we use for Big Data projects with native Spark, although we are shifting to PySpark.

Our recommendation here is IntelliJ IDEA. It is an IDE for JVM based languages (Java, Kotlin, Groovy, Scala) from PyCharm developers (JetBrains). It best feature is its native support for Scala and its similarities to PyCharm. If you came from Eclipse, can adapt key bindings and shortcuts to replicate Eclipse ones.

To install community (free) version:

sudo snap install intellij-idea-community --classic

Big Data

Okay, you are not going to really do Big Data in your laptop. In case you are in a Big Data project, your company or client is going to provide you with a proper Hadoop cluster.

But there are situations where you may want to analyze or make a model with data that doesn’t fit easily in your laptop memory. In those cases, a local Spark installation is very helpful. Using my humble laptop, I have crushed datasets sized GB on disk, which on memory translates in much more.

This is our recommendation to get Spark up and running in your laptop:

1. Create or activate a conda environment.

2. Install PySpark and OpenJDK:

conda install pyspark openjdk

3. Use your local spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder. \
	appName('your_app_name'). \
	config('spark.sql.session.timeZone', 'UTC'). \
	config('spark.driver.memory', '16G'). \
	config('spark.driver.maxResultSize', '2G'). \
	getOrCreate()

Honorable mentions here are:

Dask: simplifying things a lot, Dask is some kind of native Spark for Python. It is closer to pandas and NumPy APIs, but in our experience, it is not as robust as Spark by far. We use it from time to time.
Modin: replicates pandas API with support for multi-core and out-of-core computations. It is specially useful when you are working in a powerful analytics server with lots of cores (32, 64) and want to use pandas. As per-core performance is usually poor, Modin will allow you to speed your computation.

Database Tools

Sometimes you need a tool able to connect with a variety of different DB technologies, make queries and explore data. Our choice is DBeaver:

DBeaver is a tool that automatically downloads drivers for lots of different databases. It supports:

Database, schema, table and column name completion.
Advanced networking requirements to connect, like SSH tunnels and more.

You can install DBeaver like this:

sudo snap install dbeaver-ce

Honorable mention:

DataGrip: a database IDE by JetBrains we use sometimes, very similar to DBeaver, with less technologies supported, but very stable: sudo snap install datagrip --classic

Others

Other specific tools and apps that are important for us are:

Spotify: sudo snap install spotify.
Slack: sudo snap install slack --classic.
Telegram Desktop: sudo snap install telegram-desktop.
Nextcloud: sudo apt install nextcloud-desktop nautilus-nextcloud.
Thunderbird Mail: installed by default.
Zoom: download manually from here.
Google Chrome: download manually from here.

And this is all. Those are a lot of tools and we probably forgot something. In case you miss some category here or need some advice, leave a comment and we will try to extend the post.

Post originally published at: https://davidadrian.cc

'Data Analytics(en)' 카테고리의 다른 글

New Features in Python 3.9 (0)	2020.10.05
No More Basic Plots Please (0)	2020.10.05
Extracting Data from PDF File Using Python and R (0)	2020.10.02
Advanced Python: Itertools Library — The Gem Of Python Language (0)	2020.10.01
Data Visualisation using Pandas and Plotly (0)	2020.09.30

전체 글

Scikit-Learn (Python): 6 Useful Tricks for Data Scientists

Tricks to improve your machine learning models in Python with scikit-learn (sklearn)

1. Generate random dummy data

2. Impute missing values

3. Make use of Pipelines to chain multiple steps together

4. Save a Pipeline model using joblib

5. Plot a confusion matrix

6. Visualize decision trees

Conclusion

Level Up Coding

Coding Interview Questions | Skilled.dev

The course to master the coding interview

skilled.dev

'Data Analytics(en)' 카테고리의 다른 글

Python 3.9의 새로운 기능

최신 Python 버전에 포함 된 최고의 기능 살펴보기

사전 연합

Iterables로 사전 업데이트

유형 힌트

문자열 방법

새 파서

덜 알려진 Python 기능

덜 알려지고 과소 평가 된 Python 기능 샘플

intodatascience.com

'Data Analytics(ko)' 카테고리의 다른 글

New Features in Python 3.9

A look at the best features included in the latest iteration of Python

Dictionary Unions

Dictionary Update with Iterables

Type Hinting

String Methods

New Parser

Lesser known Python Features

A sample of some lesser known and underrated Python features

towardsdatascience.com

'Data Analytics(en)' 카테고리의 다른 글

더 이상 기본 플롯은 없습니다

Seaborn 및 Matplotlib를 사용하여 데이터 시각화를 업그레이드하기위한 빠른 가이드

더 이상 기본값, 상점 브랜드, 기본 플롯,부디!

아무것도 할 수 없다면 Seaborn을 사용하십시오.

위에서부터 스타일링

XKCD; 건방진 약간의 여분

예쁜 색상 OMG!

모든 것에는 레이블이 있어야합니다

'Data Analytics(ko)' 카테고리의 다른 글

No More Basic Plots Please

A Quick Guide to Upgrade Your Data Visualizations using Seaborn and Matplotlib

No more default, store brand, basic plots, please!

If you can do nothing else, use Seaborn.

Style it from the top

XKCD; a cheeky little extra

PRETTY COLORS OMG!

Everything Should Have a Label

'Data Analytics(en)' 카테고리의 다른 글

'Game Rank' 카테고리의 다른 글

'Game Rank' 카테고리의 다른 글

'Game Rank' 카테고리의 다른 글

확실한 데이터 과학자 환경 설정

소개 및 동기

하드웨어

운영 체제

NVIDIA 드라이버

단말기

VirtualBox

Python, R 등 (Miniconda 포함)

Jupyter

십오 일

파이썬

스칼라

빅 데이터

데이터베이스 도구

기타

'Data Analytics(ko)' 카테고리의 다른 글

The Definitive Data Scientist Environment Setup

Intro and motivation

Hardware

Operating System

NVIDIA Drivers

Terminal