Linear Regression from scratch
OK! Let's talk about linear regression. We're
going to code up a linear regressor FROM SCRATCH.
And as we go through this section, I want you to
FOCUS NOT SO MUCH ON THE CODE, but ON THE
INGREDIENTS. What are they, how do they go together.
Linear Regression
Here's a typical linear regression problem.
We're trying to PREDICT PRICES OF INDIVIDUAL HOUSES.
And we're given three pieces of information
about each house, three features:
!CLICK! FLOOR AREA,
!CLICK! DISTANCE FROM PUBLIC TRANSPORT,
!CLICK! Number of rooms.
Inputs
And we're going to represent our features in a matrix
with as many rows as we have houses and three columns,
one for each of our input features. We'll call that matrix X.
And we're trying to predict this vector Y, which represents
the housing prices.
Inputs
X_train = np.array([
[1250, 350, 3],
[1700, 900, 6],
[1400, 600, 3]
])
Y_train = np.array([345000, 580000, 360000])
Here's what that looks like in numpy.
Model
Multiply each feature by a weight and add them up.
Add an intercept to get our final estimate .
Next we have to consider our MODEL. The model
is the set of functions that we're going to
consider in mapping X to Y. Since this is
linear regression, we'll multiply each feature
by...(off slide)
Model
And that corresponds to drawing the line of best fit
through the data.
Model - Parameters
weights = np.array([300, -10, -1])
intercept = -26497
So the parameters of this model will be the
three weights that correspond to each feature,
and the intercept.
Model - Operations
And the key operation of this model will be matrix
multiplication of X by the weights. Then we'll add
the intercept element-wise to get our
final prediction.
Model - Operations
def model(X, weights, intercept):
return X.dot(weights) + intercept
Y_hat = model(X_train, weights, intercept)
Model - Cost function
Now the next ingredient we'll need is a COST FUNCTION,
also called a LOSS FUNCTION. We need this to measure how
good or bad a set of parameters is, how close our predictions
are getting to the actual values. For example, this
is a really badly-fit line.
Model - Cost function
So we'll do is take the difference between the prediction
and the actual value, and square it.
Model - Cost function
Cost function
def cost(Y_hat, Y):
return np.sum((Y_hat - Y)**2)
Optimization
Hold X and Y constant. Adjust parameters to minimize cost .
Now we need to actually find the parameters that give us the best fit.
In other words, holding X and Y constant, we'll adjust our parameters
to minimize the cost.
Optimization
Each set of parameters will yield a cost, so we can
plot cost against parameter values. Our goal in
optimization is to find the parameters that correspond
to that lowest point.
Trial and error
Image source: Wikimedia Commons
And we're going to do that by trial and error. By this I don't mean
just trying random sets of parameters and seeing what works best,
but the trial and error you do when you're, say, practising how
to shoot hoops and you're trying to adjust your angle. So you shoot,
and you miss by a couple inches. You're too far to the right. So
you adjust your angle to the left and try again.
Optimization
That's what we're going to do also. We'll try a set of parameters,
then we'll calculate our cost, and then we'll follow the gradient
of the cost curve at that point down towards the minimum. This
process is called GRADIENT DESCENT.
Optimization
Optimization - Gradient Calculation
$$\hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b$$
$$\epsilon = (y-\hat{y})^2$$
Goal: \(\frac{\partial\epsilon}{\partial w_i}, \frac{\partial\epsilon}{\partial b}\)
So we need to be able to calculate the gradient of
the cost, epsilon, with respect to each of the weights and
the intercept.
Optimization - Gradient Calculation
Chain rule: \(\frac{\partial\epsilon}{\partial w_i} =
\frac{d\epsilon}{d\hat{y}}\frac{\partial\hat{y}}{\partial w_i} \)
Applying the chain rule, we can break that up into
two pieces: the gradient of the cost with respect to
the predicted y, y hat, and the gradient of y hat
with respect to the weight. So let's calculate those.
Optimization - Gradient Calculation
$$\hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b$$
\(\frac{\partial\hat{y}}{\partial w_0} =\)\( x_0\)
The gradient of y hat with respect to w naught is pretty simple.
All the terms are constant with respect to w naught so those go
to zero and we're left with x naught.
Optimization - Gradient Calculation
$$\epsilon = (y-\hat{y})^2$$
\(\frac{d\epsilon}{d\hat{y}} =\) \(-\) \(2(y-\hat{y})\)
For the second gradient, we bring down the power and then apply
the chain rule again to bring out that negative sign.
Optimization - Gradient Calculation
\(\frac{\partial\hat{y}}{\partial w_0} = x_0\)
\(\frac{d\epsilon}{d\hat{y}} = -2(y-\hat{y})\)
\(\frac{\partial\epsilon}{\partial w_0} =
-2(y-\hat{y})x_0 \)
So to get our desired gradient, we multiply those together
to get this expression. And that goes for all the weights.
Optimization - Gradient Calculation
$$\hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b\cdot1$$
\(\frac{\partial\epsilon}{\partial b} =
-2(y-\hat{y})\cdot 1 \)
As for the intercept b, we can consider that a special
weight where the x it corresponds to is always 1. So
that's the form the gradient will take with respect to b.
Optimization - Gradient Calculation
delta_y = y - y_hat
gradient_weights = -2 * delta_y * weights
gradient_intercept = -2 * delta_y * 1
Optimization - Parameter Update
weights = weights - gradient_weights
intercept = intercept - gradient_intercept
And then we just want to move the weights in the direction
of the gradient, and we do that by subtracting.
Optimization - Overshoot
But, just like when you're practising basketball,
you might overcorrect. You're too far to the right, you
adjust your angle to the left, and you wind up too far
to the left. So you move right, and now you've overshot
in the other direction. And maybe you get angrier
and angrier so you wind up even more wildly off as
time goes on. We can do this in gradient descent
also.
Optimization - Undershoot
Or you might have the opposite problem: you're too timid
in making your corrections, so it takes you forever to get
to the minimum. You converge really slowly.
And if you have a cost curve that's uglier than this,
with lots of local minima, you may get stuck inside
a local minimum.
Optimization - Parameter Update
learning_rate = 0.05
weights = weights - \
learning_rate * gradient_weights
intercept = intercept - \
learning_rate * gradient_intercept
So we're going to try and be Goldilocks, and
try to aim for something in between those two.
We regulate this using a hyperparameter
called the learning rate. The larger the
learning rate, the bigger the steps you take.
Training
def training_round(x, y, weights, intercept,
alpha=learning_rate):
# calculate our estimate
y_hat = model(x, weights, intercept)
# calculate error
delta_y = y - y_hat
# calculate gradients
gradient_weights = -2 * delta_y * weights
gradient_intercept = -2 * delta_y
# update parameters
weights = weights - alpha * gradient_weights
intercept = intercept - alpha * gradient_intercept
return weights, intercept
Putting all that together, here's how training goes.
Whatever our current weights and intercept are,
we calculate our prediction, calculate our error,
compute the gradients, and update our parameters
by gradient descent.
Training
NUM_EPOCHS = 100
def train(X, Y):
# initialize parameters
weights = np.random.randn(3)
intercept = 0
# training rounds
for i in range(NUM_EPOCHS):
for (x, y) in zip(X, Y):
weights, intercept = training_round(x, y,
weights, intercept)
That was a single round of training. The entire
training process involves first initializing
our parameters and doing some number of training rounds.
Whatever the weights and intercept are at the end,
that's what we'll use to predict with.
Testing
def test(X_test, Y_test, weights, intercept):
Y_predicted = model(X_test, weights, intercept)
error = cost(Y_predicted, Y_test)
return np.sqrt(np.mean(error))
>>> test(X_test, Y_test, final_weights, final_intercept)
6052.79
And testing is simple: we get our estimate and figure out
how far off we were on average. And on this dataset, we
were about $6000 off.
Uh, wasn't this supposed to be a talk about neural networks?
Why are we talking about linear regression?
Okay, so you may be wondering: I came here to hear about
deep learning and neural networks. Why are we doing something
so basic as linear regression?
Surprise! You've already made a neural network!
Surprise! We actually just made a neural network!
Linear regression = Simplest neural network
Linear regression is one of the simplest possible neural networks.
It's so simple that we don't even call it a neural network, because
it preceded neural networks. But if you look at the definition of
a neural network, linear regression fits the bill. We have an input
layer, consisting of three neurons, we have an output layer, consisting
of a single neuron, and we have weights on the edges between those
neurons.
Once more, with TensorFlow
Now that we know that linear regression is a neural
network in disguise, we can rewrite it in TensorFlow.
Inputs
Model - Parameters
Model - Operations
Cost function
Optimization
Train
Test
What we're going to do is take those seven ingredients we
went through in numpy and recast them in TensorFlow.
Inputs → Placeholders
import tensorflow as tf
X = tf.placeholder(tf.float32, [None, 3])
Y = tf.placeholder(tf.float32, [None, 1])
Ingredient 1. Inputs. Already this looks very different.
Instead of supplying the data directly in numpy arrays,
we're going to have placeholders. The X placeholder says
I'll be a matrix of floats, and I'll have three columns.
The "None" here means I'm going to push through a variable
number of houses each time. You can specify a number here,
but then you'll be stuck with it. For flexibility, we'll just
say None. Similarly, the Y placeholder corresponds to
a single value, so it'll be a single column.
Parameters → Variables
# create tf.Variable(s)
W = tf.get_variable("weights", [3, 1],
initializer=tf.random_normal_initializer())
b = tf.get_variable("intercept", [1],
initializer=tf.constant_initializer(0))
Our parameters will be represented by TensorFlow
variables. We're mapping three neurons to one so the shape
of our weights will be three rows and one column. And
we can specify here how we want to initialize our
weights, which we'll sample from a random normal distribution.
The intercept we'll set to zero.
Operations
Y_hat = tf.matmul(X, W) + b
Our operation will be matrix multiplication and addition.
Cost function
cost = tf.reduce_mean(tf.square(Y_hat - Y))
And this will be our cost function. reduce_mean just means
mean.
Optimization
learning_rate = 0.05
optimizer = tf.train.GradientDescentOptimizer
(learning_rate).minimize(cost)
We'll specify that we're using gradient descent with
a certain learning rate, and the quantity we want to
minimize is cost.
Training
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# train
for _ in range(NUM_EPOCHS):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={
X: X_batch,
Y: Y_batch
})
Now for the training process. First of all, notice
that we're going to do this within a TensorFlow session.
In TensorFlow, nothing happens outside of a session.
It's only within a session that you can start writing
to the CPU or GPU, performing computations. You can't even
add two numbers in TensorFlow without going into a session.
Training
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# train
for _ in range(NUM_EPOCHS):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={
X: X_batch,
Y: Y_batch
})
Then we initialize the variables according to how
we defined them outside of the session.
Training
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# train
for _ in range(NUM_EPOCHS):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={
X: X_batch,
Y: Y_batch
})
Now we're going to feed through the actual data. X_train and Y_train
are the actual numpy arrays. We're also going to use minibatches,
where we random.shuffle the data and feed the data through batch
by batch.
Training
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# train
for _ in range(NUM_EPOCHS):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={
X: X_batch,
Y: Y_batch
})
And we pass the batches into the optimizer,
inserting them into their respective placeholders.
# Placeholders
X = tf.placeholder(tf.float32, [None, 3])
Y = tf.placeholder(tf.float32, [None, 1])
# Parameters/Variables
W = tf.get_variable("weights", [3, 1],
initializer=tf.random_normal_initializer())
b = tf.get_variable("intercept", [1],
initializer=tf.constant_initializer(0))
# Operations
Y_hat = tf.matmul(X, W) + b
# Cost function
cost = tf.reduce_mean(tf.square(Y_hat - Y))
# Optimization
optimizer = tf.train.GradientDescentOptimizer
(learning_rate).minimize(cost)
# ------------------------------------------------
# Train
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# run training rounds
for _ in range(NUM_EPOCHS):
for X_batch, Y_batch in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={X: X_batch, Y: Y_batch})
So here's all that code in one place. I want you
to notice a couple of things. First of all, remember
all that math we did to calculate the gradients? The
code that came out of that is gone.
# Placeholders
X = tf.placeholder(tf.float32, [None, 3])
Y = tf.placeholder(tf.float32, [None, 1])
# Parameters/Variables
W = tf.get_variable("weights", [3, 1],
initializer=tf.random_normal_initializer())
b = tf.get_variable("intercept", [1],
initializer=tf.constant_initializer(0))
# Operations
Y_hat = tf.matmul(X, W) + b
# Cost function
cost = tf.reduce_mean(tf.square(Y_hat - Y))
# Optimization
optimizer = tf.train.GradientDescentOptimizer
(learning_rate).minimize(cost)
# ------------------------------------------------
# Train
with tf.Session() as sess:
# initialize variables
sess.run(tf.global_variables_initializer())
# run training rounds
for _ in range(NUM_EPOCHS):
for X_batch, Y_batch in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={X: X_batch, Y: Y_batch})
The other thing is that the code is divided into two parts.
The part outside the session, and the part inside the session. And I
want you to think of that first part as like the blueprints for
something, while the part within the session is like actually building
that thing.
Computation graph
And the thing that we're building is the computation graph.
This is where we take all the variables and the operations
and sequence them together. So here's the dot product,
the addition of the intercept. But that's not all the computation we need to do. There's also
the computation of the error, so let's add that on.
Computation graph
Forward propagation
So what do we do with the computation graph? First, forward propagation.
We take our current weights and the x's and y's that got fed through
and propagate them through the graph by performing the designated operations.
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
def training_round(x, y, weights, intercept,
alpha=learning_rate):
# calculate our estimate
y_hat = model(x, weights, intercept)
# calculate error
delta_y = y - y_hat
# calculate gradients
gradient_weights = -2 * delta_y * weights
gradient_intercept = -2 * delta_y
# update parameters
weights = weights - alpha * gradient_weights
intercept = intercept - alpha * gradient_intercept
return weights, intercept
That corresponds to these two lines from our numpy code.
Backpropagation
Next, we need to calculate the gradient of the cost with respect
to each of our variables. This process is called back-propagation,
or backprop for short. We start at the end and work backwards.
Backpropagation
First off, the derivative of the error with respect to the error
is just going to be 1. Simple enough.
Backpropagation
And from this point on, what we're going to do is calculate
the local gradient and multiply it by the gradient computed up
to that point. Let's work through a concrete example.
The function here is taking the square. Derivative of that is
just 2 delta. Delta here was -2, so our local gradient is -4.
We multiply that by the gradient calculated so far, 1. So
the gradient of the error with respect to delta is going to be -4.
Backpropagation
And we just continue working backwards. I won't go through it here
but you're welcome to work through it yourself with the help of the slides.
Backpropagation
Backpropagation
Backpropagation
Backpropagation
Backpropagation
Backpropagation
Backpropagation
So the gradient with respect to w naught is -8. That's -2 times
delta times its corresponding x naught...
Backpropagation
def training_round(x, y, weights, intercept,
alpha=learning_rate):
# calculate our estimate
y_hat = model(x, weights, intercept)
# calculate error
delta_y = y - y_hat
# calculate gradients
gradient_weights = -2 * delta_y * weights
gradient_intercept = -2 * delta_y
# update parameters
weights = weights - alpha * gradient_weights
intercept = intercept - alpha * gradient_intercept
return weights, intercept
...which is what we computed should be the case in the numpy code.
Variable Update
Lastly, we have to update the variables. Let's just drop everything
that isn't a variable.
Variable Update
And then that weight w naught will be updated to the top number minus
the learning rate times the bottom number, the gradient.
Variable Update
Variable Update
def training_round(x, y, weights, intercept,
alpha=learning_rate):
# calculate our estimate
y_hat = model(x, weights, intercept)
# calculate error
delta_y = y - y_hat
# calculate gradients
gradient_weights = -2 * delta_y * weights
gradient_intercept = -2 * delta_y
# update parameters
weights = weights - alpha * gradient_weights
intercept = intercept - alpha * gradient_intercept
return weights, intercept
That corresponds to these last two lines.
Numpy → TensorFlow
sess.run(optimizer,
feed_dict={
X: X_batch,
Y: Y_batch
})
I just want to take a moment here to appreciate how much
work TensorFlow saved us. All those lines of
code in the previous slide
basically correspond to this one line of TensorFlow code.
Once we defined the computation graph, which is implicitly
encoded in the optimizer variable, all the steps within
training were handled by TensorFlow, including the gradient
computation. Which wasn't very painful in the case of linear
regression, but it can get tedious fast once we start
adding more operations to our model.
Testing
with tf.Session() as sess:
# train
# ... (code from above)
# test
Y_predicted = sess.run(model,
feed_dict = {X: X_test})
squared_error = tf.reduce_mean(
tf.square(Y_test, Y_predicted))
>>> np.sqrt(squared_error)
5967.39
Our last step is testing. Within the same session,
we run a separate set of data, X_test, through the model
to get our predictions and compute our error.
Logistic regression
So that was linear regression. What if we want to do
classification?
Problem
An example classification problem is the MNIST
dataset. These are small images of handwritten digits
0 through 9. Our features are the values of the pixels
and we're trying to predict which digit is which.
Binary classification
But before we get to the ten-way classification, let's
talk about how we would do binary classification. We have
a bunch of samples that are positive and a bunch that are negative
and we want to be able to classify them. We can do this
with logistic regression.
Binary logistic regression - Model
Take a weighted sum of the features
and add a bias term to get the logit .
Convert the logit to a probability
via the logistic-sigmoid function .
Our model will be this. First take a weighted sum of the
features and add a number that we'll call the bias. This
should sound a lot like linear regression, except that
we give the outcome a funny name, the logit. Then
we'll convert the logit to a probability of belonging
to the positive class via the logistic sigmoid function.
Binary logistic regression - Model
This is what it looks like in neural network graphical style.
We compute the logit and then apply this non-linear activation
function, which I'll depict with this red semi-circle.
Logistic-sigmoid function
$f(x) = \frac{e^x}{1+e^x}$
This is what the logistic sigmoid function looks like.
Classification with logistic regression
Image generated with playground.tensorflow.org
We'll take 0.5 as the cut-off. If the probability is greater than
0.5, we'll declare a sample positive. Otherwise, negative.
The further away you are from the line in the positive direction,
the more likely you are to be positive.
Model
Okay, now let's go back to the ten-dimensional problem. We'll
have ten neurons in our output layer corresponding to each
digit. Therefore, we'll have ten times the number of weights
and ten different bias terms. And we need a way to turn those
ten logits into probabilities. We can't just apply the logistic
function to each of those logits, because that won't give you
numbers that sum up to 1.
Softmax
Z = np.sum(np.exp(logits))
Instead, we're going to use the softmax function, which
is just the multinomial version of the logistic function.
It takes all of those logits and transforms them into
a probability distribution.
Model
So that's the graphical representation of that.
And we can start coding!
Placeholders
# X = vector length 784 (= 28 x 28 pixels)
# Y = one-hot vectors
# digit 0 = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
X = tf.placeholder(tf.float32, [None, 28*28])
Y = tf.placeholder(tf.float32, [None, 10])
Our X placeholder is going to have 784 columns, one for each pixel.
We're going to use one-hot vectors to encode our digits, with a 1
in the position corresponding to the correct digit.
Variables
# Parameters/Variables
W = tf.get_variable("weights", [784, 10],
initializer=tf.random_normal_initializer())
b = tf.get_variable("bias", [10],
initializer=tf.constant_initializer(0))
Now for the variables.
We're taking 784 input neurons to 10 output neurons,
so our weight matrix will be 784 by 10.
And we have ten biases.
Operations
Y_logits = tf.matmul(X, W) + b
Our operation is going to be matrix multiplication again.
And you're probably thinking, wait, you're missing one
operation. What happened to the softmax?
But remember from the computation graph that the model
and the cost function meld into each other, so we just
push that to the cost function.
Cost function
cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
logits=Y_logits, labels=Y))
So here's our cost function, softmax cross entropy,
and you can see that this particular function
expects logits and does the softmax internally.
If we'd computed the softmax in the operations part
and then supplied the probabilities to this function,
we'd be implicitly doing the softmax twice.
Cost function
Cross Entropy
$H(\hat{y}) = -\sum\limits_i y_i \log(\hat{y}_i)$
This cross-entropy, incidentally, is a cost function
very commonly used in classification. And it basically
says, whatever the correct y is, I want the probability
of being that y to be as close to 1 as possible. And it
imposes a logarithmic cost for your distance away from 1.
Optimization
learning_rate = 0.05
optimizer = tf.train.GradientDescentOptimizer
(learning_rate).minimize(cost)
Our optimization code is exactly the same as in linear regression.
Training
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(NUM_EPOCHS):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={X: X_batch,
Y: Y_batch})
Ditto with the training code.
Testing
predict = tf.argmax(Y_logits, 1)
with tf.Session() as sess:
# training code from above
predictions = sess.run(predict,
feed_dict={X: X_test})
accuracy = tf.reduce_mean(np.mean(
np.argmax(Y_test, axis=1) == predictions)
>>> accuracy
0.925
Testing is a bit different, because when we compute
how accurate our classification algorithm is, we don't
want to use the vector of probabilities, we want to get back a one-hot
vector with our predicted digit. So we'll define a new operation
called predict that takes the argmax of the logits.
So we run our test data through this operation and get the predictions,
and we can compute our accuracy. Which in this case is 92.5%.
It turns out that 92.5% for MNIST is pretty bad. The state of the art
on this task is upwards of 99%. And one reason for this is that we're
using a linear model, and linear models are pretty weak.
Deficiencies of linear models
Image generated with playground.tensorflow.org
When all you can do is draw a straight line, you can't approximate
a function like exclusive OR.
Deficiencies of linear models
Image generated with playground.tensorflow.org
And you can't do something like concentric circles, either.
So what can we do.
Let's go deeper!
Well, I've heard that there's this magic thing called
deep learning, so let's go deeper.
Adding another layer
So what we'll do is add a hidden layer to our neural network.
It's called hidden because we have no idea what the values
are supposed to be. We know what our X's and we know what our Y's
are, but we have no idea about the values of those hidden neurons.
Adding another layer - Variables
HIDDEN_NODES = 128
W1 = tf.get_variable("weights1", [784, HIDDEN_NODES],
initializer=tf.random_normal_initializer())
b1 = tf.get_variable("bias1", [HIDDEN_NODES],
initializer=tf.constant_initializer(0))
W2 = tf.get_variable("weights2", [HIDDEN_NODES, 10],
initializer=tf.random_normal_initializer())
b2 = tf.get_variable("bias2", [10],
initializer=tf.constant_initializer(0))
So we'll need two sets of weights and biases.
Adding another layer - operations
hidden = tf.matmul(X, W1) + b1
y_logits = tf.matmul(hidden, W2) + b2
And we'll do two rounds of matrix multiplications.
The rest of the code is just the same, so let's check our results...
Results
# hidden layers Train accuracy Test accuracy
0 93.0 92.5
1 89.2 88.8
Wait a minute.
What!
I was told going deeper was going to HELP,
but my accuracy went down!
Is Deep Learning just hype?
(Well, it's a little bit over-hyped...)
Problem
A linear transformation of a linear
transformation is still a
linear transformation!
We need to add non-linearity to the system.
And the problem is that we just did a linear transform
of a linear transform, which is still linear!
We need to add non-linearity.
Adding non-linearity
And we already know how to do that, right? Before, we applied
a non-linear activation function to our output neurons.
Now we just do the same thing for the hidden neurons.
Adding non-linearity
Non-linear activation functions
There's a bunch of non-linear activation functions
we can consider, some work better than others. The one
on the right, Rectified Linear Units, or ReLU for short,
is pretty popular, so we'll use that.
Adding non-linearity
Here's what that looks like.
Operations
hidden = tf.nn.relu(tf.matmul(X, W1) + b1)
y_logits = tf.matmul(hidden, W2) + b2
So we'll just amend our operations to apply ReLU to the logits
of the hidden layer.
Results
# hidden layers Train accuracy Test accuracy
0 93.0 92.5
1 97.9 95.2
And yay, our accuracy went up!
What the hidden layer bought us
Image generated with playground.tensorflow.org
So what does having a hidden layer buy us? Well, it can classify
things like XOR. Here's what the classification boundary looks like,
I think this is with 4 hidden neurons.
What the hidden layer bought us
Image generated with playground.tensorflow.org
And we can also classify concentric circles with 3 hidden neurons.
Adding hidden neurons
2 hidden neurons
Image generated with ConvNetJS by Andrej Karpathy
Let's take a look at how classification boundaries change as we
add more hidden neurons. Here's what it looks like with 2 neurons.
Adding hidden neurons
3 hidden neurons
Image generated with ConvNetJS by Andrej Karpathy
And you can see that as we add neurons, our classification boundary
gets more and more complex
Adding hidden neurons
Image generated with ConvNetJS by Andrej Karpathy
And what we did was to transform our 2-D space into a 5-D space
where the positive and negative samples could be linearly separated.
Universal approximation theorem
A feedforward network with a single hidden layer
containing a finite number of neurons can approximate
(basically) any interesting function
So we saw that we can make our classification boundary
more and more complex by adding neurons to the hidden layer.
And it turns out that there's a theorem that says,
you can approximate basically any
interesting function using a single hidden layer. You may need
a LOT of hidden neurons, and it may be almost impossible
to actually train, but theoretically it exists.
Regularization
So what do we do? Regularization.
Regularization
Put the brakes on the training data
by enforcing constraints on weights .
And the idea behind regularization is this. Up till now,
our data has been in the driver's seat. Whatever the data says, goes.
If I have an outlier over here and there's no other data points
to argue against it, I'm going to contort my boundary to accommodate
that outlier.
Now we're going to put the brakes on regularization. And we'll do that
by enforcing constraints on the parameters we learn.
Regularization
L2 regularization: weights should be small.
$L = \sum{w_i^2}$
You're probably familiar with L1 and L2 regularization, because
you also see them used in linear and logistic regression, SVMs,
etc. And what L2 regularization says is simply: I want my weights to be small.
And it does that by imposing a cost, or loss, on the weights
that's the sum of the squares of the weights. The larger the weights,
the bigger that cost.
L2 Regularization in TensorFlow
cost += REGULARIZATION_CONSTANT * \
(tf.nn.l2_loss(W1) +
tf.nn.l2_loss(W2) +
tf.nn.l2_loss(W3))
And we'll add that regularization loss to our existing
data loss, which we already encoded into the cost function.
We have another hyperparameter here that determines how
tightly we leash the data.
Results
Regularization Train accuracy Test accuracy
None 95.5 92.9
L2 95.1 95.1
And here are the results! Pretty good!
Dropout - Train
Now let's talk about a different regularization method,
called dropout. This isn't really applicable to linear
methods, but it's something we can do once we have hidden layers.
Dropout - Train
And this is what we're going to do: at each training step,
we're going to randomly knock out half of our hidden neurons.
Dropout - Train
Dropout: why it works
"Averaging" over several models
Forces redundancy of useful features
No conspiracies! Hidden neurons must be individually useful
And this probably seems CRAZY to you. Why would we want to forfeit
half the modelling power of that layer? Why would we want to lose half
the information in our weights? Well, we do it because it works.
And here are some reasons people cite for why it works.
Firstly, every time we drop out some neurons, it's like we're building
a new prediction model. And when in the end, we take all the neurons
together, it's like we're averaging together all those models. If you're
familiar with ensemble learning, it's just like that, except that you're
doing it internally. Kinda freaky, I know.
Second reason is that it forces useful features to repeat themselves.
And that makes our network more robust.
Thirdly, we can't have a good conspiracy if your co-conspirator might
drop out at any time. So hidden neurons must be individually useful,
and again, we can't have three neurons over here conspiring to accommodate
an outlier because one or two of them may be missing at any time.
Dropout - Test
Okay, so during training we drop out half the neurons.
At test time, we want to keep all the hidden nodes.
But that creates a problem: our logits at each point are
going to be twice as much as they were in training.
Dropout - Train
To solve that problem, for the neurons that stayed alive
in training, we'll multiply all their logits.
Dropout in TensorFlow
# add a new placeholder
keep_prob = tf.placeholder(tf.float32)
# add a step to the model
hidden = tf.nn.relu(tf.matmul(X, w0) + b0)
dropout = tf.nn.dropout(hidden, keep_prob)
y_logits = tf.nn.relu(tf.matmul(dropout, w1) + b1)
Now let's look at the code. We need to add a new placeholder.
What the placeholder does is define the probability of keeping
a neuron. It's usually set to 0.5.
And then we're going to add a dropout operation between the hidden
layer and the output layer, supplying it the keep probability.
Dropout in TensorFlow
with tf.Session() as sess:
# ... init, then train:
for _ in range(NUM_EPOCHS * 2 ):
for (X_batch, Y_batch) in get_minibatches(
X_train, Y_train, BATCH_SIZE):
sess.run(optimizer,
feed_dict={
X: X_batch, Y: Y_batch,
keep_prob: 0.5
})
# test
sess.run(predict, feed_dict={X: X_test,
keep_prob: 1.0 })
And during training we pass it a keep_prob of 0.5, dropping out
half the neurons, and during test time we supply it a keep_prob of 1.0.
TensorFlow handles all the dropping out of neurons and the scaling of
the logits. Thank you, TensorFlow!
Another thing to note is that training is generally a lot slower with
dropout, because you're considering all these different models. So
you'll want to bump up your number of training rounds.
Results
Regularization Train accuracy Test accuracy
None 95.5 92.9
L2 95.1 95.1
Dropout 93.3 93.1
And with MNIST I found that I got a modest increase in test accuracy.
Where to from here?
Okay! We made it to the end! Where do we go from here?
Ingredients
Placeholders
Model - Variables
Model - Operations
Cost function
Optimization
Train/Test
Regularization
We saw that there were a number of ingredients
we had to define and consider in every single
model we created.
A guide to further exploration
Placeholders
Model - Variables
Model - Operations
Cost function
Optimization
Train/Test
Regularization
These ingredients also constitute a good roadmap
of exploring deep learning further.
A guide to further exploration
Placeholders
Model - Variables
Model - Operations
Cost function
Optimization
Train/Test
Regularization
We'll ignore placeholders and cost function
because those are pretty much driven by the problem.
We also saw that training and testing didn't differ
much by problem. So let's explore the others.
Model - Variables
# layers, # neurons / layer
Image source: Fjodor van Veen (2016) Neural Network Zoo
In terms of variables, we can consider varying the number of
hidden layers, and the number of hidden neurons per layer.
Model - Variables
tf.random_normal_initializer
tf.random_uniform_initializer
tf.truncated_normal_initializer
tf.constant_initializer
tf.contrib.layers.xavier_initializer
We can also explore different methods of initializin variables.
Model - Operations
Activation functions: ReLU, tanh, leaky ReLU, Maxout...
Image source: Fjodor van Veen (2016) Neural Network Zoo
And different non-linear activation functions. ReLU is the current
recommended starting point but you can play around with these others.
Model
Convolutional neural networks (images)
Recurrent neural networks (sequences & time series)
Image source: Fjodor van Veen (2016) Neural Network Zoo
You can also consider an entirely different architecture than feedforward neural networks.
If you're doing anything with images, take a look at convolutional neural networks and
if you're doing anything with text, sequences, time series, look at recurrent neural networks.
Optimization
Try Adam
Image source: Alec Radford
We used gradient descent exclusively in this talk but it's actually one of the slowest-converging
optimization problems. It's the red straggler in this GIF. The current recommended
one to try first is the Adam optimizer, so give that a shot.
Optimization & Regularization
L1, L2 regularization
Dropout
Batch normalization
Layer normalization
Lastly, there are a bunch of regularization slash
optimization techniques that can help with training and
reducing overfitting. We looked at L2 regularization and dropout,
also check out batchnorm and layernorm.
Other toolkits
Torch (PyTorch)
Caffe
mxnet
DyNet
Many others...
We used the TensorFlow library in this talk, but there are
other deep learning toolkits out there. These are some I've heard
good things about.
Keras
$$
\begin{align*}
\textrm{numpy} &: \textrm{scikit-learn} \\
&:: \\
\textrm{TensorFlow} &: \textrm{Keras}
\end{align*}
$$
I also want to give a plug for Keras. This is a higher-level
library that sits atop of TensorFlow, there's also a Theano back-end.
And Keras is sort of to TensorFlow as scikit-learn is to numpy,
it simplifies things considerably.
Keras
from keras.models import Sequential
model = Sequential()
model.add(Dense(input_dim=784, units=128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.SGD(lr=0.05),
metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=100, batch_size=120)
model.evaluate(X_test, Y_test)
This is how the TensorFlow code we've been using translates to Keras.
Hopefully with the background provided in this talk, you can understand
what's happening in every line intuitively.
Final thoughts
If you're familiar with traditional ML, you can do deep learning!
But you'll need data. Lots of it.
So try traditional ML first.
Go forth and experiment!
Thank you!
Slides: michelleful.github.io/PyCon2017