The objective of this tutorial is to introduce the concept of model ensembles, and show one specific type of model ensemble in detail. The type of model ensemble we will cover is called Maximum Margin Output Coding (MMOC).
So what exactly is a model ensemble? A model ensemble is the process of using more than one model instead of a single model for prediction; the outputs of several models are combined together using a specified voting scheme to achieve a single prediction for each record.
MMOC is an approach that is used for multi-classification (i.e. binary classification is just between two classes A and B, but multi-classification can handle N number of classes A,B,C,D, etc). It does this by using N number of binary classifiers, and then combining their results into a single prediction for each record.
MMOC is similar to Error Correcting Output Coding (ECOC), but is simpler because it takes advantage of the probabilistic real-valued output values from margin-based classifiers (in other words, it uses the probability output number between 0.0 and 1.0 ). ECOC requires individual cutoff values to be chosen before the outputs can be combined, but MMOC simply uses the probabilistic output as a proxy for "confidence". The simplest version of MMOC simply chooses the final class by selecting the model with the highest confidence score (highest probability of being in the class). An extremely simple example is shown below, but we'll also implement this in practice later:
For a detailed research paper on MMOC, refer to this publication:
And for a easier to understand slide-deck on margin based encoding:
There are other equally valid approaches to tackle multi-classification besides MMOC that do not even use an ensemble (such as a Bayes Net or Softmax Regression), but there are some scenarios where it makes more sense to use an ensemble of several models rather than a single model. For example, if the different classes are driven by different and muttually exclusive features, then an ensemble would be more appropriate. Here is classic example where MMOC would be preferred over a single multinomial classfier:
Also, MMOC ensembling allows the individual N binary classifers to use different features and even different models. Class "A" might use logistic regression, but class "B" might use Support Vector Machines (just to illustrate, this scenario would actually be unusual).
Now let's get to our example!
# import necessary libaries
from sklearn import datasets
from sklearn import metrics
import sklearn as sk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import KFold
from sklearn import svm
from matplotlib import colors
%matplotlib inline
For this exercise, we are going to use the standard Iris dataset found in the scikit-learn library.
iris_dict = datasets.load_iris()
print iris_dict.keys()
The Iris data has has 3 possible target classes:
target_names = list(iris_dict['target_names'])
feature_names = iris_dict['feature_names']
print target_names
print feature_names
# put the X variables into a dataframe
irisX = pd.DataFrame(iris_dict['data'], columns=iris_dict['feature_names'])
# let's make a copy so we can add the target variable
iris = irisX.copy()
# add the numeric target
iris['target'] = iris_dict['target']
# let's shuffle the rows so we get a more representative sample of targets when we preview the data using head()
np.random.seed(seed=4)
iris = iris.iloc[np.random.choice(len(iris), len(iris), replace=False, ),:]
# create mapping of numeric target to target actual name
tn_dict = {i:target_names[i] for i in range(len(target_names))}
# add target name as well
iris['target_name'] = iris['target'].map(tn_dict)
# preview final data
iris.head()
Now let's show some descriptive statistics for each column. Note the uniform distribution of the target. This uniform distribution will allow us to simplify our approach to MMOC later on:
# Prints descriptive stats on the data.
# For numerics, it prints mean, min, max, and standard deviation
# For categorical, it prints the count distribution
def desc_statistics(df):
for col in df:
x = df[col]
# print name & datatype
print col, "[datatype="+str(x.dtype)+"]"
# handle categorical
if x.dtype=='object':
print "\tCounts:"
print "\t",str(x.value_counts()).replace("\n","\n\t")
# handle numeric
else:
print "\tMean:",str(np.mean(x))
print "\tMin:",str(np.min(x))
print "\tMax:",str(np.max(x))
print "\tStd Dev:",str(np.std(x))
desc_statistics(iris)
You may have thought we were ready to begin modeling, but remember since we have multiple classes (3 in this case), we need to create 3 binary variables to represent our targets.
# binarize the target
dummies = pd.get_dummies(iris['target_name'])
# add/update binarized cols to df
for predcol in dummies.columns:
iris[predcol] = dummies[predcol]
iris.head()
Now it's time to build our binary models. Each model should be trained and cross-validated separately. We could use different sets of features for each class and even use different types of models. However, the purpose of this tutorial is not to teach feature selection, model selection, and cross-validation, so I won't explain the details here. I just want to show how to combine output from several models. For more background information on classification and cross-validation, please refer to these lecture notes from Carnegie Mellon University's Practical Data Science course.
Now let's create our 3 binary classifiers using SVM models:
# loop through each target class
for target_name in target_names:
# split data into X and y
X = iris[feature_names]
y = iris[target_name]
k = len(X)
kf = KFold(len(X), n_folds=k, shuffle=True, random_state=5)
#loop through each fold
for train_index, hold_index in kf:
# split into train and hold
trainx = X.iloc[train_index,:]
trainy = y[train_index]
holdx = X.iloc[hold_index,:]
# build SVM
model = svm.SVC(probability=True, C=1,gamma=.05,random_state=5)
# fit the model using train and return probabilistic predictions of holdout
fit = model.fit(trainx, trainy)
prob = fit.predict_proba(holdx)
# add the probabilist output to the original dataframe
iris.loc[hold_index,'probclass_'+target_name] = prob[:,0]
As you see below, we added a probabilistic output for each classifer (the columns with the "probclass_" prefix
iris.head()
Even though this is a multinomial classification problem, we still want to check the validity of the individual binary classifiers before we combine results. However, even if all of our binary classifiers have good AUCs, this tells us NOTHING about the error of our final output. In the next section we'll combine the binary classifiers and then measure the overall error using a confusion matrix. But for now, let's take a look at how our binary classifiers did:
# take a dataframe and plot all of the binary classifier curves on a single figure
def plotROC(df,pred_col_prefix):
df = df.copy()
# get a list of the prediction columns
predcols = [col for col in df.columns if col[:10]==pred_col_prefix]
fig, ax = plt.subplots()
mycolors = ['blue','red','green'] +list(colors.cnames)
for i, (target_name, predcol) in enumerate(zip(target_names,predcols)):
truth = df[target_name]
pred = df[predcol]
# calclulate fpr/tpr metrics
fpr, tpr, _ = metrics.roc_curve(truth, pred)
# calc the AUC
auc = metrics.roc_auc_score(truth, pred)
# set labels
plt.ylabel("TPR")
plt.xlabel("FPR")
plt.title("ROC Curve (using k-fold Cross Validation)")
# plot the default model line
plt.plot([0,1])
# plot tpr/fpr on ROC
leg_label = target_name + ", AUC=" + str(round(auc, 2))
ax.plot(fpr, tpr, color=mycolors[i], label=leg_label)
# Begin citation for legend: http://matplotlib.org/1.3.0/examples/pylab_examples/legend_demo.html
# Now add the legend with some customizations.
legend = ax.legend(loc='lower right')
plt.show()
plotROC(iris,"probclass_")
We'll discuss detailed interpretation of TPR and FPR later. For now, just recognize that a random choice model would produce the diagonal line. A perfect model (0 error) would be a line that goes straight up from (0,0) to (0,1), then takes a 90 degree turn right to (1,1). Any curve above the diagonal is doing better than random chance, and the closer it bulges towards (0,1) the better. AUC measures the area under the curve. AUC=1 indicates a perfect model, and AUC=0.50 indicates a random choice model (the diagonal line).
Now that we validated our individual binary classifiers, it's time to combine their results into a single final prediction for each record.
As we eluded to earlier, we will do this by using the simplest approach of MMOC: pick the classifier that has the highest confidence (in this case, we are simply using the real-valued probability as the confidence). This approach is only valid if the distribution of our classes is relatively uniform. If the distribution was not uniform, we would either need another measure of confidence taking the distribution into account, or we would need to apply cutoff values and employ ECOC instead.
# this function takes a dataframe, looks for columns that have probabilistic output (denoted by prefix probclass_)
# it then picks the winning classifier based on which one had the highest confidence
# it records the winner in both a pred_target_name column, and also in binarized predclass_ columns
def applyMMOC(df):
df = df.copy()
# get a list of the prediction columns
predcols = [col for col in df.columns if col[:10]=='probclass_']
# choose the winning classifier by picking the one with the highest probability/confidence
df['pred_target_name'] = df[predcols].idxmax(axis=1)
# trim the predicted target name
df['pred_target_name'] = df['pred_target_name'].apply(lambda x: x.replace('probclass_',""))
# get dummies of pred as well, we can use them later to easily calc error metrics
pred_dummies = pd.get_dummies(df['pred_target_name'])
pred_dummies.columns = ["predclass_"+str(col) for col in pred_dummies.columns]
# add the dummies to the df
for predcol in pred_dummies.columns:
df[predcol]=pred_dummies[predcol]
return df
# preview final data with predictions
iris = applyMMOC(iris)
iris.head()
Now that we have combined the output of our 3 binary models, we can now measure our error using a confusion matrix.
# This funtion returns a nice df confusion matrix. Sklearn.metrics has it's own function but this is nicer.
def getConfusionMatrix(df, y_truth, y_pred):
df = df.copy()
labels = set(df[y_truth]).union(set(df[y_pred]))
# create column of ones
df['ones'] = np.ones(len(df))
# create pivoted df that shows truth vs predicted value counts
confmat = pd.pivot_table(df, values = 'ones', index=[y_truth], columns=[y_pred], aggfunc=np.sum, margins=True)
return confmat
confmat = getConfusionMatrix(iris,'target_name','pred_target_name')
confmat
The matrix rows contain the true label counts, and the columns show the predicted label counts. A perfect model would result in a diagonal matrix (non-zero counts along the diagonal, with zero everywhere else).
You can think of overall accuracy as the sum of the values along the diagonal in the confusion matrix, divided by the total record count:
accuracy = sk.metrics.accuracy_score(iris['target_name'],iris['pred_target_name'])
accuracy
Now we will calculate the True Positive Rate (TPR) and the False Postive Rate (FPR) for each class. If you look at the ROC curves earlier, you'll notice these are the 2 metrics on each axis. A TPR measures how many times a model correctly classifies a record as being part of a given class. A FPR measures how many times a model classifies a record as being a member of the given class, when in reality it is NOT a member of that class. A perfect model would have a TPR of 1.00 (100%) and a FPR of 0.00 (0%). In practice, there is usually a tradeoff between these 2 metrics. They can also be expressed as the equations below:
# given a confusion matrix df, return a df of the TPR and FPR of each class
def getTPRFPR(confusion_matrix_df):
df = confusion_matrix_df.copy()
# get target_names and remove "All"
target_names = list(df.index)
target_names.remove("All")
# take a subset of the confusion matrix, removing "All" rows/cols
df = df.loc[target_names,target_names]
# create a dict to fill with other dictionaries of metrics for each label
label_metrics = {}
for label in target_names:
# add sub dictionary for metric
label_metrics[label] = {}
label_metrics[label]["FPR"] = sum(df.loc[((df.index!=label) ),label])/(sum(sum(df.values)) - sum(df.loc[label,target_names]))
label_metrics[label]["TPR"] = df.loc[label,label]/sum(df.loc[label,target_names])
# make it into a nice dataframe for display
tprfpr_df = pd.DataFrame(label_metrics).transpose()
return tprfpr_df
getTPRFPR(confmat)
Not bad! Looks like our TPRs are relatively high yet our FPRs are relatively low.
Although this Iris dataset is not necessarily the ideal usage of a model ensemble using MMOC, my objective was to convey the methodology as simply as possible. This model ensemble framework can be greatly extended with other several other important data science techniques that were purposely glossed over in this tutorial (visualization, feature selection, cross validation, etc).
For additional background on other aspects of data science, a terrific resource is the website for Carnegie Mellon University's 15-688 Practical Data Science Course. Hope you enjoyed my tutorial!
Written by: