Scientific Computing
Sklearn
or scikit-learn
, originally conceived as an extension to SciPy.pip
because, well, sometimes it is called “sklearn” and sometimes “scikit-learn”.$ python3 -m pip install sklearn
Collecting sklearn
Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.
scikit-learn
python3 -m pip install scikit-learn
python3 -m pip show scikit-learn # show scikit-learn version and location
python3 -m pip freeze # show all installed packages in the environment
python3 -c "import sklearn; sklearn.show_versions()"
python
/python3
andnp
or pd
of handwritten digits
{'data': array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]]),
'target': array([0, 1, 2, ..., 8, 9, 8]),
'frame': None,
'feature_names': ['pixel_0_0',
'pixel_0_1',
'pixel_0_2',
'pixel_0_3',
'pixel_0_4',
'pixel_0_5',
'pixel_0_6',
'pixel_0_7',
'pixel_1_0',
'pixel_1_1',
'pixel_1_2',
'pixel_1_3',
'pixel_1_4',
'pixel_1_5',
'pixel_1_6',
'pixel_1_7',
'pixel_2_0',
'pixel_2_1',
'pixel_2_2',
'pixel_2_3',
'pixel_2_4',
'pixel_2_5',
'pixel_2_6',
'pixel_2_7',
'pixel_3_0',
'pixel_3_1',
'pixel_3_2',
'pixel_3_3',
'pixel_3_4',
'pixel_3_5',
'pixel_3_6',
'pixel_3_7',
'pixel_4_0',
'pixel_4_1',
'pixel_4_2',
'pixel_4_3',
'pixel_4_4',
'pixel_4_5',
'pixel_4_6',
'pixel_4_7',
'pixel_5_0',
'pixel_5_1',
'pixel_5_2',
'pixel_5_3',
'pixel_5_4',
'pixel_5_5',
'pixel_5_6',
'pixel_5_7',
'pixel_6_0',
'pixel_6_1',
'pixel_6_2',
'pixel_6_3',
'pixel_6_4',
'pixel_6_5',
'pixel_6_6',
'pixel_6_7',
'pixel_7_0',
'pixel_7_1',
'pixel_7_2',
'pixel_7_3',
'pixel_7_4',
'pixel_7_5',
'pixel_7_6',
'pixel_7_7'],
'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
'images': array([[[ 0., 0., 5., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 15., 5., 0.],
[ 0., 3., 15., ..., 11., 8., 0.],
...,
[ 0., 4., 11., ..., 12., 7., 0.],
[ 0., 2., 14., ..., 12., 0., 0.],
[ 0., 0., 6., ..., 0., 0., 0.]],
[[ 0., 0., 0., ..., 5., 0., 0.],
[ 0., 0., 0., ..., 9., 0., 0.],
[ 0., 0., 3., ..., 6., 0., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.]],
[[ 0., 0., 0., ..., 12., 0., 0.],
[ 0., 0., 3., ..., 14., 0., 0.],
[ 0., 0., 8., ..., 16., 0., 0.],
...,
[ 0., 9., 16., ..., 0., 0., 0.],
[ 0., 3., 13., ..., 11., 5., 0.],
[ 0., 0., 0., ..., 16., 9., 0.]],
...,
[[ 0., 0., 1., ..., 1., 0., 0.],
[ 0., 0., 13., ..., 2., 1., 0.],
[ 0., 0., 16., ..., 16., 5., 0.],
...,
[ 0., 0., 16., ..., 15., 0., 0.],
[ 0., 0., 15., ..., 16., 0., 0.],
[ 0., 0., 2., ..., 6., 0., 0.]],
[[ 0., 0., 2., ..., 0., 0., 0.],
[ 0., 0., 14., ..., 15., 1., 0.],
[ 0., 4., 16., ..., 16., 7., 0.],
...,
[ 0., 0., 0., ..., 16., 2., 0.],
[ 0., 0., 4., ..., 16., 2., 0.],
[ 0., 0., 5., ..., 12., 0., 0.]],
[[ 0., 0., 10., ..., 1., 0., 0.],
[ 0., 2., 16., ..., 1., 0., 0.],
[ 0., 0., 15., ..., 15., 0., 0.],
...,
[ 0., 4., 16., ..., 16., 6., 0.],
[ 0., 8., 16., ..., 16., 8., 0.],
[ 0., 1., 8., ..., 12., 1., 0.]]]),
'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 1797\n:Number of Attributes: 64\n:Attribute Information: 8x8 image of integer pixels in the range 0..16.\n:Missing Attribute Values: None\n:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n:Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. dropdown:: References\n\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n Graduate Studies in Science and Engineering, Bogazici University.\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n Linear dimensionalityreduction using relevance weighted LDA. School of\n Electrical and Electronic Engineering Nanyang Technological University.\n 2005.\n - Claudio Gentile. A New Approximate Maximal Margin Classification\n Algorithm. NIPS. 2000.\n"}
digits
is a “bunch”, which I’d never actually heard of before in my life, but appears almost identical to a Python dict
or “dictionary.dict
[]
or the .get()
method..get()
returns None
or a default value if the key is not found, preventing errors.del
keyword or the .pop()
method..pop()
also returns the value of the removed item.my_dict
:np.ndarray
is a homogeneous data structure, dictionaries can store heterogeneous data.pd.DataFrame
objects.pd.DataFrame
.sklearn
.GridSearchCV
or RandomizedSearchCV
to specify parameter grids for tuning.from sklearn.linear_model import LogisticRegression
params = {"penalty": "l1", "C": 0.1, "solver": "liblinear"}
model = LogisticRegression(**params)
model.get_params()
{'C': 0.1,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'deprecated',
'n_jobs': None,
'penalty': 'l1',
'random_state': None,
'solver': 'liblinear',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
sklearn
pipeline or model.f
(or F
) and use curly braces {}
to contain expressions..format()
or %
.DictVectorizer
in text processing.from sklearn.feature_extraction import DictVectorizer
data = [{"color": "red", "size": "small"}, {"color": "blue", "size": "large"}]
vec = DictVectorizer(sparse=False)
features = vec.fit_transform(data)
print(features)
print(vec.get_feature_names_out())
[[0. 1. 0. 1.]
[1. 0. 1. 0.]]
['color=blue' 'color=red' 'size=large' 'size=small']
pd.DataFrame
.sklearn
, dictionaries are indispensable for hyperparameter tuning and model configuration, making your machine learning workflows more organized and reproducible.array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
im
is an 8-by-8 array of values from 0 to 16'Greys'
in there.'spline16'
looked good).1794
arrays of length 64
(was: 8-by-8).reshape
array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.],
[ 0., 0., 0., 12., 13., 5., 0., 0., 0., 0., 0., 11., 16.,
9., 0., 0., 0., 0., 3., 15., 16., 6., 0., 0., 0., 7.,
15., 16., 16., 2., 0., 0., 0., 0., 1., 16., 16., 3., 0.,
0., 0., 0., 1., 16., 16., 6., 0., 0., 0., 0., 1., 16.,
16., 6., 0., 0., 0., 0., 0., 11., 16., 10., 0., 0.],
[ 0., 0., 0., 4., 15., 12., 0., 0., 0., 0., 3., 16., 15.,
14., 0., 0., 0., 0., 8., 13., 8., 16., 0., 0., 0., 0.,
1., 6., 15., 11., 0., 0., 0., 1., 8., 13., 15., 1., 0.,
0., 0., 9., 16., 16., 5., 0., 0., 0., 0., 3., 13., 16.,
16., 11., 5., 0., 0., 0., 0., 3., 11., 16., 9., 0.]])
X
) and corresponding outcomes (y
).d = {'size': [1000, 1500, 1200, 1800], 'bedrooms': [2, 3, 2, 4], 'price': [200, 300, 240, 360]}
df = pd.DataFrame(d)
X_train, y_train = df[['size', 'bedrooms']], df['price']
model = LinearRegression()
model.fit(X_train, y_train)
model.coef_, model.intercept_
(array([ 2.00000000e-01, -1.32379986e-14]), np.float64(0.0))
The artificial neuron network was invented in 1943 by Warren McCulloch and Walter Pitts in A logical calculus of the ideas immanent in nervous activity.
from sklearn.linear_model import Perceptron
clf = Perceptron() # This will be the model that will contain coefficients
clf # At first, not fitted to anything!
Perceptron()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Perceptron()
.fit()
the model to the data.Perceptron()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Perceptron()
metrics
from sklearn import metrics
# We print so it looks a bit nicer
print(metrics.classification_report(y_test, y_pred))
precision recall f1-score support
0 0.98 0.98 0.98 44
1 0.83 0.98 0.90 44
2 1.00 0.95 0.97 39
3 0.98 0.95 0.97 44
4 1.00 0.95 0.97 38
5 0.96 1.00 0.98 51
6 1.00 0.98 0.99 48
7 1.00 1.00 1.00 37
8 0.84 0.92 0.88 63
9 0.97 0.74 0.84 42
accuracy 0.94 450
macro avg 0.96 0.94 0.95 450
weighted avg 0.95 0.94 0.94 450
array([ 4, 8, 35, 36, 45, 56, 63, 67, 113, 158, 184, 191, 198,
199, 204, 273, 305, 316, 326, 352, 360, 366, 415, 437, 444])
y
).k
predefined clusters.k
parameter (number of clusters) must be specified in advance.k
is a common challenge in \(k\)-means.k
often depends on domain knowledge and the goal of the clustering.(array([[0.39250186, 0.86027584, 0.0088714 , 0.09031264, 0.92406155],
[0.02823733, 0.66593355, 0.91355326, 0.38840791, 0.86576046],
[0.13334149, 0.67490569, 0.47429288, 0.74461236, 0.62912872],
[0.45579094, 0.29786937, 0.85740943, 0.08848002, 0.78133508],
[0.07463008, 0.96488199, 0.27140834, 0.93849056, 0.72967892]]),
array([-0.80350742, -1.76749167]))
StandardScaler().fit_transform()
x
and y