Articles Tutorials Interview Questions

Tutorial Playlist

Python Scikit-Learn Cheat Sheet: Complete Guide Tutorial For Free | CHECK-OUT

Python Scikit-Learn Cheat Sheet Tutorial

Prev Next

Last updated on 02nd Jul 2020| 3157

(5.0) | 19734 Ratings E-mail this post

What is scikit-learn or sklearn?

Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Please note that sklearn is used to build machine learning models. It should not be used for reading the data, manipulating and summarizing it. There are better libraries for that (e.g. NumPy, Pandas etc.)

Components of scikit-learn:

Scikit-learn comes loaded with a lot of features. Here are a few of them to help you understand the spread:

Supervised learning algorithms:

Think of any supervised machine learning algorithm you might have heard about and there is a very high chance that it is part of scikit-learn. Starting from Generalized linear models (e.g Linear Regression), Support Vector Machines (SVM), Decision Trees to Bayesian methods – all of them are part of scikit-learn toolbox. The spread of machine learning algorithms is one of the big reasons for the high usage of scikit-learn. I started using scikit to solve supervised learning problems and would recommend that to people new to scikit / machine learning as well.

Cross-validation:

There are various methods to check the accuracy of supervised models on unseen data using sklearn.

Unsupervised learning algorithms:

Again there is a large spread of machine learning algorithms in the offering – starting from clustering, factor analysis, principal component analysis to unsupervised neural networks.

Various toy datasets:

This came in handy while learning scikit-learn. I had learned SAS using various academic datasets (e.g. IRIS dataset, Boston House prices dataset). Having them handy while learning a new library helped a lot.

Feature extraction:

Scikit-learn for extracting features from images and text (e.g. Bag of words)

Community / Organizations using scikit-learn:

One of the main reasons behind using open source tools is the huge community it has. Same is true for sklearn as well. There are about 35 contributors to scikit-learn till date, the most notable being Andreas Mueller (P.S. Andy’s machine learning cheat sheet is one of the best visualizations to understand the spectrum of machine learning algorithms).

There are various Organizations of the likes of Evernote, Inria and AWeber which are being displayed on scikit learn home page as users. But I truly believe that the actual usage is far more.

In addition to these communities, there are various meetups across the globe. There was also a Kaggle knowledge contest, which finished recently but might still be one of the best places to start playing around with the library.

Machine Learning cheat sheet – see Original image for better resolution

Interfaces

The library is organized around three fundamental APIs (interfaces): Estimator, Predictor, and Transformer. Importantly and crucially these interfaces are complimentary — they do not represent hard boundaries between classes or precise semantic separation, but rather an overlap. For example, the DecisionTree classifier is both an Estimator and a Predictor — more on this in a bit.

Estimators

Estimators represent the core interface in Scikit-Learn. All learning algorithms, whether supervised or unsupervised, classification, regression, or clustering, implement the Estimator interface and expose a fit method.

An Estimator’s fit method takes as input a (training) feature vector (“samples” or “predictors”) as well as (training) target labels (in the case of supervised learning), and in this way the estimator “learns” how to make predictions on unseen data (again, in the case of supervised learning).

A key design principle is that the instantiation of an Estimator (where, for example, you denote a model’s hyper-parameters) is decoupled from the learning process (where you fit the model with training data — your feature vectors, e.g., “X_train”; as well as your training labels/target variables, e.g., “Y_train”). That is, when you construct an Estimator (such as a DecisionTree classifier, as noted earlier), you pass in hyper-parameters, but you do not pass in the training data; the training data is passed in via the fit method. As noted in the “API Design…” paper, this separation is similar to the idea of “partial function application”, where certain arguments are bound — or frozen — to one function, and then that function (with its frozen arguments) is passed to another function — a kind of higher-order function, or functional composition. Indeed, this “pattern”, as it were, is supported by the Python standard library, through functools.partial.

Predictors

One of the confusing aspects of Scikit-Learn’s interface choices is the separation of Predictors from Estimators; Predictors extend the Estimator interface, and for a given model to “work” it must implement (and expose) a predict method. Indeed, Scikit-Learn’s glossary denotes a Predictor as “an estimator supporting predict and/or fit_predict.” Yet Scikit-Learn semantically separates these two notions through the API. Although this is not a tutorial on how to use the library, and though this may be stating the obvious, the predict method that a model implements as part of the Predictor interface does the work of predicting results — given test features; and after a model is trained, the predictor returns predicted labels for a given input feature vector. Predictors may also provide probabilities as well as prediction scores, which is part of another Design principle (discussed shortly).

Transformers

Not surprisingly, Transformers modify data. The “API Design…” paper states it best:Like models, transformers also need to be fitted and, once done, their transform method may be invoked. In keeping with the elegant and consistent simplicity of the library, the fit method always returns the estimator it was called on, allowing you to chain the fit and transform methods. For convenience, transformers also support a fit_transform method as well, so users may do both in one step.

The class diagram denoted below shows how the StandardScaler transformer is also an Estimator, and also mixes in the TransformerMixin class. However, the code to use this transformer is — once again — remarkably straightforward.

Transparency

All of the hyper-parameters used to construct estimators and transformers, as well as the results of fitting and prediction, are visible to users of the interface as public attributes. Superficially this may seem to contravene the encapsulation and information hiding principles that have long been a hallmark of object-oriented design, where object state is usually made available only through the mediation of class methods. But this design choice greatly simplifies the library (by obviating the proliferation of access methods): because the interface is already consistent (and both lucid and cogent), adding gratuitous methods to fetch key model attributes or prediction results would only diminish its effectiveness.

Core Data Structures

The library’s designers likewise made a straightforward and concrete decision to base the core data representations needed for machine learning on Numpy multi-dimensional arrays — rather than introduce a bespoke set of classes to encapsulate data representing features and labels/targets. As with many other choices, this reduces the barrier to entry, as it were — a user doesn’t have to learn a new class hierarchy for machine learning data representation — as well as ensures performance in terms of both time and space, given that Numpy is optimized for performance using C.

Composing Estimators Pipelines

The consistent and uniform interface across all the core semantic components — Estimators, Predictors, and Transformers — affords the library additional power and flexibility in allowing users to compose new, or augmented, functionality through a chaining together of estimators. Keeping in mind that transformers are a kind of estimator (recall the class diagram for the StandardScaler above, deriving from BaseEstimator), Scikit-Learn provides a Pipeline class that allows a user to chain together multiple transformers; because all transformers share the same interface, the pipeline can fit a user’s data to all the transformers (by iterating over the collection of transformers it has been constructed with), and then, if fit_transform is also called, applying the transformations and returning the transformed data. The code snippet below demonstrates how easy it is to fill in missing values with the mean of similar features (through the SimpleImputer class), and then scale the data (through the StandardScaler class).

Extensibility Through Duck-Typing

The last design principle I want to mention is the library’s reliance on so-called duck-typing to allow for extensions to the library. Duck-typing means that if a class supports a specific method — i.e., if it “looks like a duck” — then it can be used interchangeably with other objects of the same interface — i.e., it is a duck. This avoids the need for inheritance, which in theory makes the code less tightly coupled, brittle, and complex. That is, if a user wants to extend the library, for example, by writing a custom transformer, they do not necessarily need to inherit from Scikit-Learn classes. In this way the library’s creators bring Pythonic flexibility and pragmatism to Scikit-Learn.

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on — the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.

Naive Bayes

The Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

K-Nearest Neighbors

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

K-Means

The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

Scikit Learn Cheat Sheet Python Machine Learning

An easy-to-follow scikit learn tutorial that will help you to get started with the Python machine learning. A handy scikit-learn cheat sheet to machine learning with Python, this includes the function and its brief description

Pre-Processing

Function	Description
sklearn.preprocessing.StandardScaler	Standardize features by removing the mean and scaling to unit variance
sklearn.preprocessing.Imputer	Imputation transformer for completing missing values
sklearn.preprocessing.LabelBinarizer	Binarize labels in a one-vs-all fashion
sklearn.preprocessing.OneHotEncoder	Encode categorical integer features using a one-hot a.k.a one-of-K scheme
sklearn.preprocessing.PolynomialFeatures	Generate polynomial and interaction features

Regression

Function	Description
sklearn.tree.DecisionTreeRegressor	A decision tree regressor
sklearn.svm.SVR	Epsilon-Support Vector Regression
sklearn.linear_model.LinearRegression	Ordinary least squares Linear Regression
sklearn.linear_model.Lasso	Linear Model trained with L1 prior as regularized (a.k.a the lasso)
sklearn.linear_model.SGDRegressor	Linear model fitted by minimizing a regularized empirical loss with SGD
sklearn.linear_model.ElasticNet	Linear regression with combined L1 and L2 priors as regularizor
sklearn.ensemble.RandomForestRegressor	A random forest regressor
sklearn.ensemble.GradientBoostingRegressor	Gradient Boosting for regression
sklearn.neural_network.MLPRegressor	Multi-layer Perceptron regressor

classification

Function	Description
sklearn.neural_network.MLPClassifier	Multi-layer Perceptron classifier
sklearn.tree.DecisionTreeClassifier	A decision tree classifier
sklearn.svm.SVC	C-Support Vector Classification
sklearn.linear_model.LogisticRegression	Logistic Regression (a.k.a logit, Max Ent) classifier
sklearn.linear_model.SGDClassifier	Linear classifiers (SVM, logistic regression, a.o.) with SGD training
sklearn.naive_bayes.GaussianNB	Gaussain Naïve Bayes
sklearn.neighbors.KNeighborsClassifier	Classifier implementing the k-nearest neighbors vote
sklearn.ensemble.RandomForestClassifier	A random forest classifier
sklearn.ensemble.GradientBoostingClassifier	Gradient Boosting for classification

Clustering

Function	Description
sklearn.cluster.Kmeans	K-Means clustering
sklearn.cluster.DBSCAN	perform DBSCAN clustering from vector array or distance matrix
sklearn.cluster.AgglomerativeClustering	Agglomerative clustering
sklearn.cluster.SpectralBiclustering	Spectral bi-clustering

Dimensionality Reduction

Function	Description
sklearn.decomposition.PCA	Principal component analysis (PCA)
sklearn.decomposition.LatentDirichletAllocation	Latent Dirichlet Allocation with online variational Bayes algorithm
sklearn.decomposition.SparseCoder	Sparse coding
sklearn.decomposition.DictionaryLearning	Dictionary learning

MetricModel SelectionFunctionDescriptionsklearn.model_selection.KfoldK-Folds cross-validatorsklearn.model_selection.StratifiedKFoldStratified K-Flods cross-validatorsklearn.model_selection.TimeSeriesSplitTime Series cross-validatorsklearn.model_selection.train_test_splitSplit arrays or matrices into random train and test subsetssklearn.model_selection.GridSearchCVExhaustive search over specified parameter value for an estimatorsklearn.model_selection.cross_val_scoreEvaluate a score by cross-validation

Metric

Function	Description
sklearn.metrics.accuracy_score	Classification Metric: Accuracy classification score
sklearn.metrics.log_loss	Classification Metric: Log loss, a.k.a logistic loss or cross-entropy loss
sklearn.metrics.roc_auc_score	Classification Metric: Compute Receiver operating characteristics ROC
sklearn.metrics.mean_absolute_error	Regression Metric: Mean absolute error regression loss
sklearn.metrics.r2_score	Regression Metric: R^2 (coefficient of determination) regression score
sklearn.metrics.label_ranking_loss	Ranking Metric: Compute Ranking loss measure
sklearn.metrics.mutual_info_score	Clustering Metric: Mutual Information between two clustering.

Explore Python Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

Miscellaneous

Function	Description
sklearn.datasets.load_boston	Load and return the Boston house prices data set (regression)
sklearn.datasets.make_classification	Generate a random n-class classification problem
sklearn.feature_extraction.FeatureHasher	Implements feature hashing, a.k.a the hashing trick
sklearn.feature_selection.SelectKBest	Select features according to the k highest scores
sklearn.pipeline.Pipeline	Pipeline of transforms with a final estimator
sklearn.semi_supervised.LabelPropagation	Label Propagation classifier for semi-supervised learning

Conclusion

Data scientists, whether students or professionals, are fortunate to have such a profoundly rich and well-designed library such as Scikit-Learn, which allows us to tackle complex machine learning problems with beautiful and well-designed code.

Name	Date	Details
	30-June-2025 (Weekdays) Weekdays Regular
	02-July-2025 (Weekdays) Weekdays Regular
	5-July-2025 (Weekends) Weekend Regular
	6-July-2025 (Weekends) Weekend Fasttrack