Data Preparation

A. How to represent features

Features are properties of some entity. This can be the height and weight of a person. Or the count of a number of words or sentences in a document.

Features must be defined before any classification or clustering.

Scikit Learn uses numpy 2-dimensional arrays to represent data for machine learning.

In [1]:
import numpy as np 
a, b = np.arange(10).reshape((5, 2)), range(5)
In [2]:
a
Out[2]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
In [3]:
b
Out[3]:
[0, 1, 2, 3, 4]

A.1 Continuous Features

Continuous features are numbers.

Make sure they are not 'NaN' (not a number), or 'inf'. This can create problems.

In [10]:
data = [[1, 2, 3], [4, 5, 6]]
np_data = np.array(data)
In [7]:
import math
math.sqrt(-1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-6557c4e6d55f> in <module>()
      1 import math
----> 2 math.sqrt(-1)

ValueError: math domain error
In [12]:
data
Out[12]:
[[1, 2, 3], [4, 5, 6]]
In [13]:
np_data
Out[13]:
array([[1, 2, 3],
       [4, 5, 6]])

A.2 Categorical Features

Categorical features represent properties that have discrete classes like words, parts of speech, native language, etc.

Since most classifiers (including those in sklearn) expect all continuous features, these need to be converted to continuous values.m

A one-hot representation is used for this. Construct a zero vector with as many entries as the number of categories. Identify each catogory with one index. Set the index corresponding to the vector to 1.

In [8]:
categories = ['MALE', 'FEMALE']
In [9]:
def toOneHot(cats, v):
    out = np.zeros(len(cats))
    idx = cats.index(v)
    out[idx] = 1
    return out
    
print toOneHot(categories, 'FEMALE')
print toOneHot(categories, 'MALE')
[ 0.  1.]
[ 1.  0.]

In [10]:
print toOneHot(categories, 'M')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-cbef20385927> in <module>()
----> 1 print toOneHot(categories, 'M')

<ipython-input-9-2e8424e12d98> in toOneHot(cats, v)
      1 def toOneHot(cats, v):
      2     out = np.zeros(len(cats))
----> 3     idx = cats.index(v)
      4     out[idx] = 1
      5     return out

ValueError: 'M' is not in list
In [34]:
def toOneHot(cats, v):
    out = np.zeros(len(cats))
    try:
        idx = cats.index(v)
        out[idx] = 1
    except ValueError:
        ''
    return out
In [35]:
print toOneHot(categories, 'M')
[ 0.  0.]

B. How to represent labels (aka classes, or targets)

Classes are typically binary labels, represented as either 0/1. Or sometimes -1/+1

For multi-class classification, integer labels from 0 to N where N is the number of classes can be used.

In [41]:
data, labels = np.arange(10).reshape((5, 2)), range(5)
In [44]:
data
Out[44]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
In [43]:
labels
Out[43]:
[0, 1, 2, 3, 4]

Load some sample data

In [11]:
from sklearn import datasets
iris = datasets.load_iris()
/usr/local/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2507: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)

In [12]:
iris.data
Out[12]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4.8,  3.1,  1.6,  0.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 5.2,  4.1,  1.5,  0.1],
       [ 5.5,  4.2,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5.5,  3.5,  1.3,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 4.4,  3. ,  1.3,  0.2],
       [ 5.1,  3.4,  1.5,  0.2],
       [ 5. ,  3.5,  1.3,  0.3],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 4.4,  3.2,  1.3,  0.2],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 5.1,  3.8,  1.9,  0.4],
       [ 4.8,  3. ,  1.4,  0.3],
       [ 5.1,  3.8,  1.6,  0.2],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 5.3,  3.7,  1.5,  0.2],
       [ 5. ,  3.3,  1.4,  0.2],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.4,  3.2,  4.5,  1.5],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 5.7,  2.8,  4.5,  1.3],
       [ 6.3,  3.3,  4.7,  1.6],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 5.2,  2.7,  3.9,  1.4],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 6. ,  2.2,  4. ,  1. ],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 5.6,  2.9,  3.6,  1.3],
       [ 6.7,  3.1,  4.4,  1.4],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5.8,  2.7,  4.1,  1. ],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.8,  4.7,  1.2],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 5.7,  2.6,  3.5,  1. ],
       [ 5.5,  2.4,  3.8,  1.1],
       [ 5.5,  2.4,  3.7,  1. ],
       [ 5.8,  2.7,  3.9,  1.2],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 5.5,  2.5,  4. ,  1.3],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 5. ,  2.3,  3.3,  1. ],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.7,  2.9,  4.2,  1.3],
       [ 6.2,  2.9,  4.3,  1.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 5.7,  2.8,  4.1,  1.3],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 7.1,  3. ,  5.9,  2.1],
       [ 6.3,  2.9,  5.6,  1.8],
       [ 6.5,  3. ,  5.8,  2.2],
       [ 7.6,  3. ,  6.6,  2.1],
       [ 4.9,  2.5,  4.5,  1.7],
       [ 7.3,  2.9,  6.3,  1.8],
       [ 6.7,  2.5,  5.8,  1.8],
       [ 7.2,  3.6,  6.1,  2.5],
       [ 6.5,  3.2,  5.1,  2. ],
       [ 6.4,  2.7,  5.3,  1.9],
       [ 6.8,  3. ,  5.5,  2.1],
       [ 5.7,  2.5,  5. ,  2. ],
       [ 5.8,  2.8,  5.1,  2.4],
       [ 6.4,  3.2,  5.3,  2.3],
       [ 6.5,  3. ,  5.5,  1.8],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 7.7,  2.6,  6.9,  2.3],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 6.9,  3.2,  5.7,  2.3],
       [ 5.6,  2.8,  4.9,  2. ],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 7.2,  3.2,  6. ,  1.8],
       [ 6.2,  2.8,  4.8,  1.8],
       [ 6.1,  3. ,  4.9,  1.8],
       [ 6.4,  2.8,  5.6,  2.1],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 7.4,  2.8,  6.1,  1.9],
       [ 7.9,  3.8,  6.4,  2. ],
       [ 6.4,  2.8,  5.6,  2.2],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 6.3,  3.4,  5.6,  2.4],
       [ 6.4,  3.1,  5.5,  1.8],
       [ 6. ,  3. ,  4.8,  1.8],
       [ 6.9,  3.1,  5.4,  2.1],
       [ 6.7,  3.1,  5.6,  2.4],
       [ 6.9,  3.1,  5.1,  2.3],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6.8,  3.2,  5.9,  2.3],
       [ 6.7,  3.3,  5.7,  2.5],
       [ 6.7,  3. ,  5.2,  2.3],
       [ 6.3,  2.5,  5. ,  1.9],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 5.9,  3. ,  5.1,  1.8]])
In [55]:
iris.data[:, :2]  # first two columns
Out[55]:
array([[ 5.1,  3.5],
       [ 4.9,  3. ],
       [ 4.7,  3.2],
       [ 4.6,  3.1],
       [ 5. ,  3.6],
       [ 5.4,  3.9],
       [ 4.6,  3.4],
       [ 5. ,  3.4],
       [ 4.4,  2.9],
       [ 4.9,  3.1],
       [ 5.4,  3.7],
       [ 4.8,  3.4],
       [ 4.8,  3. ],
       [ 4.3,  3. ],
       [ 5.8,  4. ],
       [ 5.7,  4.4],
       [ 5.4,  3.9],
       [ 5.1,  3.5],
       [ 5.7,  3.8],
       [ 5.1,  3.8],
       [ 5.4,  3.4],
       [ 5.1,  3.7],
       [ 4.6,  3.6],
       [ 5.1,  3.3],
       [ 4.8,  3.4],
       [ 5. ,  3. ],
       [ 5. ,  3.4],
       [ 5.2,  3.5],
       [ 5.2,  3.4],
       [ 4.7,  3.2],
       [ 4.8,  3.1],
       [ 5.4,  3.4],
       [ 5.2,  4.1],
       [ 5.5,  4.2],
       [ 4.9,  3.1],
       [ 5. ,  3.2],
       [ 5.5,  3.5],
       [ 4.9,  3.1],
       [ 4.4,  3. ],
       [ 5.1,  3.4],
       [ 5. ,  3.5],
       [ 4.5,  2.3],
       [ 4.4,  3.2],
       [ 5. ,  3.5],
       [ 5.1,  3.8],
       [ 4.8,  3. ],
       [ 5.1,  3.8],
       [ 4.6,  3.2],
       [ 5.3,  3.7],
       [ 5. ,  3.3],
       [ 7. ,  3.2],
       [ 6.4,  3.2],
       [ 6.9,  3.1],
       [ 5.5,  2.3],
       [ 6.5,  2.8],
       [ 5.7,  2.8],
       [ 6.3,  3.3],
       [ 4.9,  2.4],
       [ 6.6,  2.9],
       [ 5.2,  2.7],
       [ 5. ,  2. ],
       [ 5.9,  3. ],
       [ 6. ,  2.2],
       [ 6.1,  2.9],
       [ 5.6,  2.9],
       [ 6.7,  3.1],
       [ 5.6,  3. ],
       [ 5.8,  2.7],
       [ 6.2,  2.2],
       [ 5.6,  2.5],
       [ 5.9,  3.2],
       [ 6.1,  2.8],
       [ 6.3,  2.5],
       [ 6.1,  2.8],
       [ 6.4,  2.9],
       [ 6.6,  3. ],
       [ 6.8,  2.8],
       [ 6.7,  3. ],
       [ 6. ,  2.9],
       [ 5.7,  2.6],
       [ 5.5,  2.4],
       [ 5.5,  2.4],
       [ 5.8,  2.7],
       [ 6. ,  2.7],
       [ 5.4,  3. ],
       [ 6. ,  3.4],
       [ 6.7,  3.1],
       [ 6.3,  2.3],
       [ 5.6,  3. ],
       [ 5.5,  2.5],
       [ 5.5,  2.6],
       [ 6.1,  3. ],
       [ 5.8,  2.6],
       [ 5. ,  2.3],
       [ 5.6,  2.7],
       [ 5.7,  3. ],
       [ 5.7,  2.9],
       [ 6.2,  2.9],
       [ 5.1,  2.5],
       [ 5.7,  2.8],
       [ 6.3,  3.3],
       [ 5.8,  2.7],
       [ 7.1,  3. ],
       [ 6.3,  2.9],
       [ 6.5,  3. ],
       [ 7.6,  3. ],
       [ 4.9,  2.5],
       [ 7.3,  2.9],
       [ 6.7,  2.5],
       [ 7.2,  3.6],
       [ 6.5,  3.2],
       [ 6.4,  2.7],
       [ 6.8,  3. ],
       [ 5.7,  2.5],
       [ 5.8,  2.8],
       [ 6.4,  3.2],
       [ 6.5,  3. ],
       [ 7.7,  3.8],
       [ 7.7,  2.6],
       [ 6. ,  2.2],
       [ 6.9,  3.2],
       [ 5.6,  2.8],
       [ 7.7,  2.8],
       [ 6.3,  2.7],
       [ 6.7,  3.3],
       [ 7.2,  3.2],
       [ 6.2,  2.8],
       [ 6.1,  3. ],
       [ 6.4,  2.8],
       [ 7.2,  3. ],
       [ 7.4,  2.8],
       [ 7.9,  3.8],
       [ 6.4,  2.8],
       [ 6.3,  2.8],
       [ 6.1,  2.6],
       [ 7.7,  3. ],
       [ 6.3,  3.4],
       [ 6.4,  3.1],
       [ 6. ,  3. ],
       [ 6.9,  3.1],
       [ 6.7,  3.1],
       [ 6.9,  3.1],
       [ 5.8,  2.7],
       [ 6.8,  3.2],
       [ 6.7,  3.3],
       [ 6.7,  3. ],
       [ 6.3,  2.5],
       [ 6.5,  3. ],
       [ 6.2,  3.4],
       [ 5.9,  3. ]])
In [58]:
iris.target
Out[58]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [18]:
 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-7206067b80aa> in <module>()
----> 1 print iris.descr

AttributeError: 'Bunch' object has no attribute 'descr'

Classification

In [19]:
from sklearn.cross_validation import train_test_split
In [28]:
train_data, test_data, train_labels, test_labels = train_test_split(iris.data, iris.target)
In [29]:
len(test_data), len(test_labels)
Out[29]:
(38, 38)
In [30]:
len(train_data), len(train_labels)
Out[30]:
(112, 112)
In [31]:
train_data[0]
Out[31]:
array([ 5.4,  3.9,  1.3,  0.4])

A. Naive Bayes

In [37]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
classifier = nb.fit(train_data, train_labels)

test_pred = classifier.predict(test_data)
print("Number of mislabeled points out of a total %d points : %d"
      % (test_data.shape[0],(test_labels != test_pred).sum()))
Number of mislabeled points out of a total 38 points : 1

In [36]:
(test_labels != test_pred).sum()
Out[36]:
1
In [38]:
nb.score(test_data, test_labels)
Out[38]:
0.97368421052631582
In [39]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, test_pred)
Out[39]:
array([[13,  0,  0],
       [ 0, 12,  0],
       [ 0,  1, 12]])

B. K-NN

In [40]:
from sklearn import neighbors
n_neighbors = 5
knn = neighbors.KNeighborsClassifier(n_neighbors)
In [41]:
## This should look very familiar
classifier = knn.fit(train_data, train_labels)

test_pred = classifier.predict(test_data)
print("Number of mislabeled points out of a total %d points : %d" % (test_data.shape[0],(test_labels != test_pred).sum()))
Number of mislabeled points out of a total 38 points : 0

In [42]:
confusion_matrix(test_labels, test_pred)
Out[42]:
array([[13,  0,  0],
       [ 0, 12,  0],
       [ 0,  0, 13]])

C. Logistic Regression

In [43]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
In [44]:
## This should look very familiar
classifier = lr.fit(train_data, train_labels)

test_pred = classifier.predict(test_data)
print("Number of mislabeled points out of a total %d points : %d" % (test_data.shape[0],(test_labels != test_pred).sum()))
Number of mislabeled points out of a total 38 points : 0

In [45]:
confusion_matrix(test_labels, test_pred)
Out[45]:
array([[13,  0,  0],
       [ 0, 12,  0],
       [ 0,  0, 13]])

Clustering

A. K-Means

In [46]:
from sklearn.cluster import KMeans
In [58]:
km = KMeans(n_clusters=3)
km.fit(iris.data)
Out[58]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)
In [59]:
test_labels = km.predict(iris.data)
In [61]:
confusion_matrix(iris.target, test_pred)
Out[61]:
array([[50,  0,  0],
       [ 0, 48,  2],
       [ 0, 13, 37]])
In [52]:
km = KMeans(n_clusters=2)
km.fit(train_data)
Out[52]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)
In [53]:
test_labels = km.predict(test_data)
confusion_matrix(test_labels, test_pred)
Out[53]:
array([[ 0, 12, 13],
       [13,  0,  0],
       [ 0,  0,  0]])

B. Spectral Clustering

In [54]:
from sklearn.cluster import SpectralClustering
sc = SpectralClustering(n_clusters=3)
sc.fit(train_data)
Out[54]:
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          n_clusters=3, n_init=10, n_neighbors=10, random_state=None)
In [55]:
test_labels = sc.predict(test_data)
confusion_matrix(test_labels, test_pred)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-55-c7033a69786c> in <module>()
----> 1 test_labels = sc.predict(test_data)
      2 confusion_matrix(test_labels, test_pred)

AttributeError: 'SpectralClustering' object has no attribute 'predict'
In [56]:
from sklearn.cluster import SpectralClustering
sc = SpectralClustering(n_clusters=3)
In [57]:
test_pred = sc.fit_predict(iris.data)
confusion_matrix(iris.target, test_pred)
Out[57]:
array([[50,  0,  0],
       [ 0, 48,  2],
       [ 0, 13, 37]])

More Resources

In []: