Making a Scikit Learn Classifier

Sklearn has many built in classifiers that can be imported and used. In this post I will describe how to make your own classifier that is compatible with all the other sklearn modules such as cross validation. Kaggle provides a dataset based on the Titanic sinking. It includes a list of all the passengers that were onboard along with information about: whether or not they survived; age; sex; cabin class and a few other things. I am going to build a very basic classifier here that just asks: is the passenger male or female, if they are female then predict they survived, if they are male predict they perished. This classifier can predict with about 75% accuarcy.

Import the relevant python modules. Pandas to handle the data as a dataframe; crossvalidation from sklearn to allow splitting of the data into a training set and a test set; preprocessing from sklearn for some basic munging functions; ClassifierMixin and BaseEstimator are the sklearn base classes that give us the required structure for our estimator.

Import the data set (it will need to be in your working directory) and convert it to a Pandas data set and drop some of the columns that won't be required. The first row of the summary variable will hold the mean values for each feature.

Now define some processing functions. MinMaxScalar scales each feature into a given range. LabelEncoder is used to transform the categorical data of the sex column into numerical data.

Define a class that inherits from the BaseEstimator and ClassifierMixin classes. It must contain two functions: fit and predict. Predict simply returns a list of 1's or 0's for each element on the data set. It returns a 1 if the element is a female or a 0 if they are a male.

Turn the dataframe into a numpy array and split it into the features [Sex] and the target [Survived]

Define an instance of the class and call it using the cross validation function.

This is obviously a very basic classifier. The aim of this post was to get across the concept of building your own classifier. More details can be found here with information about how to use training data with the fit method.

Comments