Tutorial: Welcome

Overview

In this tutorial, we walk through how to build a logistic regressor in Tflon using Wisconsin Diagnostic Breast Cancer dataset as our example. We will build on this example extensively to demonstrate how to fully utilize the Tflon deep learning package and extend it with custom features. We will assume a working knowledge of pandas and object oriented programming in python. Futher, we recommend Deep Learning by Benjio as this tutorial doesn’t serve as a replacement for solid practical and theoretical understanding of deep learning.

Setup

Clone the latest tflon repository and ensure the following requirements:

  • pandas 0.21.0+
  • tensorflow 1.6.0+
  • horovod 0.13.4+
  • mpi4py 3.0.0+
  • mpirun 2.1.0+

Instalation

You can find the source code at https://bitbucket.org/mkmatlock/tflon, and install it using Mercurial:

hg install https://<your username>@bitbucket.org/mkmatlock/tflon/src/default/
pip install ./tflon

Imports

Before using tflon in python, we use the following lines to import the necessary dependancies

import pandas as pd
import tensorflow as tf
import tflon

Environment Reset

By the nature of Tensorflow which maintains a stateful representation of the network being trained. In order to instantiate another network, we have to reset the system with

tflon.system.reset()

This will reset the tensorflow stored network. So, that you can define and train another model within the same python file or jupyter notebook kernel.

Example Dataset

For the purposes of this tutorial, we will be using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset found at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

The data has 30 real valued features including

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

We seek to construct a classifier to determine if the biometrics correspond to a malignant (212) or benign (357) using the 569 samples.

Logistic Regressor

Check out http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/ if you’re not familiar with it. Basically, you can think of it as a best linear fit of the log odds ratio minimizing the pseudo-\(R^2 = 1-\frac {\text{deviance}}{\text{null deviance}}\).

From the provided link:

  • Logistic regression models are multiplicative in their inputs.
  • The exponent of each coefficient tells you how a unit change in that input variable affects the odds ratio of the response being true.
  • Logistic regression is coordinate-free: translations, rotations, and rescaling of the input variables will not affect the resulting probabilities.
  • Logistic regression preserves the marginal probabilities of the training data.
  • Overly large coefficient magnitudes, overly large error bars on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs.
  • Coefficients that tend to infinity could be a sign that an input is perfectly correlated with a subset of your responses. Or put another way, it could be a sign that this input is only really useful on a subset of your data, so perhaps it is time to segment the data.
  • Pseudo-\(R^2\) is a useful goodness-of-fit heuristic.

Pre-Proccessing

The data is in the form of comma separated values(csv): ID, Malignant/Benign, Features. Using pandas, we extract the targets (‘targ’) with a binary encoding 1 for malignant and 0 otherwise. Informally, the targets are the observed labels we are trying to learn and ‘desc’ for the descriptors which are the features we are learning from.

df = pd.read_csv("./wdbc.data" , header=None)
df[1] = df[1].apply(lambda x : 1 if (x == 'M') else 0)
df = df.iloc[:,1:]
targ, desc = tflon.data.Table(df).split([1])

Then, we create a tflon Tablefeed which essentially stores the ‘targ’ and ‘desc’ as a dictionary making for efficient training.

feed = tflon.data.TableFeed({'desc':desc, 'targ':targ})

Defining a Model

For this example, we are simply building a ridge logistic regressor so it suffices to feed the normalized input into a single dense neuron with a sigmoidal activation.

To define a new model, we define a new subclass of tflon.model.Model and override the _model attribute. Using add_input and add_target, we indicate that the input should have 30 features and a single target. Again, a target is the observed labels that we want our network to learn from.

class MyModel(tflon.model.Model):
    def _model(self):
        I = self.add_input('desc', shape=[None, 30])
        T = self.add_target('targ', shape=[None, 1])

Architecture

Now in defining the actual architecture, we appeal to the toolkit subpackage which has layers available for us.

WIn = tflon.toolkit.WindowInput()
Dense = tflon.toolkit.Dense(1)
net = WIn(Dense)
out = net(I)

To be more concise, we can stack layers ontop of each other with the ‘|’ symbol allowing us to construct L which is the instantitation of the defined architecture

net = tflon.toolkit.WindowInput() |\
      tflon.toolkit.Dense(1)

out = net(I)

In the case of the WindowInput function, it normalizes each feature with the following function

\[\frac {x - min}{max-min}\]

As for Dense(\(x\)), it simply constructs fully connected neuron layer with \(x\) neurons. For us, we only use 1 neuron, so \(x = 1\).

To be truly minimalistic, this layer can be skipped all together:

net= tflon.toolkit.Dense(1)

This is the main area of design for customizing the model that we build. Additionally, the recommend way of adding new custom functionality to tflon is by adding a new layer with the ‘|’ symbol. We’ll discuss in more detail how to define new layers here

Prediction

The prediction is the final pass made on the output which in our case by choosing sigmoidal activation we result in a logistic regressor.

self.add_output( 'pred', tf.nn.sigmoid(out) )

Objective Function

In a logistic regressor there is a objective function or a cost function that is minimiazed to definine the best fit. The L2 is an artifact of our choice to build a ridge regressor.

self.add_loss( 'xent', tflon.toolkit.xent_uniform_sum(T, out) )
self.add_loss( 'l2', tflon.toolkit.l2_penalty(self.weights) )

Checkout the full list of loss functions in the toolkit here.

Metrics

Finally, we define some metrics that allow us to judge the performance of our logistic regressor. In particular, here we have used the Area Under the Reciever Operating Characteristic curve (AUC) as our metric which gauges sensitivity (true positive rate) and specificity (false positive rate). We can then retrieve the value later as will be mentioned shortly.

self.add_metric( 'auc', tflon.toolkit.auc(T, out) )

Putting it together

``` python class MyModel(tflon.model.Model): def _model(self): I = self.add_input(‘desc’, shape=[None, 30]) T = self.add_target(‘targ’, shape=[None, 1])

net = tflon.toolkit.WindowInput() |\
      tflon.toolkit.Dense(1)
out = net(I)

self.add_output( 'pred', tf.nn.sigmoid(out) )
self.add_loss( 'xent', tflon.toolkit.xent_uniform_sum(T, out) )
self.add_loss( 'l2', tflon.toolkit.l2_penalty(self.weights) )
self.add_metric( 'auc', tflon.toolkit.auc(T, out) )
```

Train the Model

To train the model, we first instantiate the model and define our trainer:

python  LR = MyModel() trainer = tflon.train.OpenOptTrainer( iterations=100)

Then exactly synonomous to tensorflow, we execute the training from within a session which can be thought of as denoting an environment after which background bookkeeping will be erased.

with tf.Session():
    LR.fit( feed, trainer, restarts=2 )
    metrics = LR.evaluate(feed)
    print "AUC:", metrics['auc']

Note that here we are able to use evaluate to retrieve our AUC.asdf

Save and Load Trained Model

Once we’re trained a model we want to be able to reuse it in the future. So it’s crucial that we mention how to save and load models here.

To save a model we simply use the save function within the session but after we do the fit. This will produce a pickle file

LR.save("saved_lr.pkl")

Then next time when we want to load the model back, we can simply use load. Note that you have to manually initialize the model before use.

LR = tflon.model.Model.load("saved_lr.pkl")
with tf.Session():
    LR.initialize()
    x = LR.evaluate(feed)
    print "AUC:", metrics['auc']

Putting it all together

Congratulations our ridge logistic regressor is ready!

import pandas as pd
import tensorflow as tf
import tflon

class MyModel(tflon.model.Model):
    def _model(self):
        I = self.add_input('desc', shape=[None, 30])
        T = self.add_target('targ', shape=[None, 1])

        net = tflon.toolkit.WindowInput() |\
              tflon.toolkit.Dense(1)
        out = net(I)

        self.add_output( 'pred', tf.nn.sigmoid(out) )
        self.add_loss( 'xent', tflon.toolkit.xent_uniform_sum(T, out) )
        self.add_loss( 'l2', tflon.toolkit.l2_penalty(self.weights) )
        self.add_metric( 'auc', tflon.toolkit.auc(T, out) )

if __name__=='__main__':

    df = pd.read_csv("./wdbc.data" , header=None)
    df[1] = df[1].apply(lambda x : 1 if (x == 'M') else 0)
    df = df.iloc[:,1:]
    targ, desc = tflon.data.Table(df).split([1])
    feed = tflon.data.TableFeed({'desc':desc, 'targ':targ})

    LR = MyModel()

    trainer = tflon.train.OpenOptTrainer( iterations=100)

    with tf.Session():
        LR.fit( feed, trainer, restarts=2 )
        LR.save("saved_lr.p")
        metrics = LR.evaluate(feed)
        print "AUC:", metrics['auc']

Sample Output

2018-06-11 13:28:38.304284: I Found a solution with loss 7.27e+01 AUC: 0.99299717