Post

Derivative and Backpropagation

Derivative and Backpropagation

Introduction

This post revisits the fundamental concepts of derivatives and highlights their crucial role in training neural networks. We will begin by methodically calculating the derivative of a mathematical expression, refreshing our understanding of its principles and implications.

Along the way, we will cover the chain rule, a critical technique for computing derivatives of composite functions, and explore how it is applied during backpropagation.

After covering these basics, we will put the concepts into practice using an incomplete implementation of Andrej Karpathy’s Micrograd Library, performing a manual backpropagation process on a simple neural network.

If you are already well-versed in Calculus, the chain rule, and their applications in neural networks, feel free to skip ahead.

Derivative

A Neural Network Perspective

When I first revisited the core concepts of derivatives to understand how a neural network trains itself, all sorts of definitions I had learned in high school came flooding back to me:

  • Instantaneous rate of change
  • Slope of the tangent line
  • Sensitivity of a function’s change with respect to its input

These are all valid descriptions of derivatives, and we generally have a good understanding of how each ties to the underlying concept. However, for me, the last description—sensitivity of change—resonates most clearly when thinking about neural network training.

In essence, a neural network is built upon matrix multiplications of input data, weights, and biases. Training involves iteratively adjusting these parameters to minimize the loss function and better represent the data. Derivatives play a crucial role here, as they measure how sensitive the loss function is to each parameter, guiding their updates through optimization algorithms like gradient descent.

Definition

As described above, the difference quotient measures how the function’s output changes as its input ($x=a$) changes by $h$. \(f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}\)

  • $f’(a)$ : The derivative of $f(a)$ at $x=a$
  • $h$ : Small change in $a$

Example

Example 1.1

Let’s play around with the mathematical expression: \(y = 7x^2 + 6x + 3\)

graph of equation

Using a power rule, we come up with: \(y' = 14x + 6\)

However, let’s use the derivative formula to compute the derivative when $x=5$.

1
2
3
4
5
6
7
def f(x):
    return (7 * x**2) + (6 * x) + 3

x = 5
h = 0.0001

print((f(x+h) - f(x)) / h) # Should be y' = 14(5) + 6 = 76
1
76.00069999995185

The derivative of 76.0 tells us, for a small change in $x$, the value of $y$ increases by approximately 76 times the $x$.

\[Δy ≈ y' ⋅ h = 76 ⋅ 0.0001 = 0.0076\]
1
2
3
4
diff = f(x+h) - f(x)
derivative = (f(x+h) - f(x)) / h

print(f"Difference by h: {diff}, derivative * h: {derivative * h}")
1
Difference by h: 0.007600069999995185, derivative * h: 0.007600069999995185

Example 1.2

This time, let’s compute a partial derivative of given expression \(y = a * b + c\)

Partial derivatives:

  • With respect to $a$: \(\frac{dy}{da} = b\)

  • With respect to $b$: \(\frac{dy}{db} = a\)

  • With respect to $c$: \(\frac{dy}{dc} = 1.0\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def f(a, b, c):
    return a * b + c

a = 2.0
b = -3.0
c = 10.0

h = 0.0001

constant = f(a, b, c)

# With respect to a: -3.0
print(f"df/da: {(f(a+h, b, c) - constant) / h}")

# With respect to b: 2.0
print(f"df/db: {(f(a, b+h, c) - constant) / h}")

# With respect to c: 1.0
print(f"df/dc: {(f(a, b, c+h) - constant) / h}")
1
2
3
df/da: -3.000000000010772
df/db: 2.0000000000042206
df/dc: 0.9999999999976694

Manual Backpropagation on Micrograd

Micrograd is an auto-differentiation engine designed to work with scalar values. For the purpose of manually performing gradient calculations and backpropagation, we will utilize a simplified and incomplete implementation of the Value class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self.label = label
    
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other,), '+')

        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other,), '*')

        return out

Using the Value class, we will create a tiny network of:

1
2
3
4
5
6
7
a = Value(2.0);         a.label = 'a'
b = Value(-3.0);        b.label = 'b'
c = Value(10.0);        c.label = 'c'
e = a * b;              e.label = 'e'
d = e + c;              d.label = 'd'
f = Value(-2.0);        f.label = 'f'
L = d * f;              L.label = 'L'
neural network without grad

In this neural network, our goal is to compute the derivative of $L$ (the loss) with respect to the variables or weights in the expression. Why is this important? It helps us understand how much each parameter contributes to changes in the output. This concept will become clearer as we explore the loss function and gradient descent. For now, focus on the idea that our objective is to compute the gradient at each node.

Now, let’s compute derivative from the last layer (i.e., output layer) of the neural network.

Derivative of $L$:

  • With respect to $L$: \(\frac{dL}{dL} = 1.0\)
  • With respect to $f$: \(\frac{dL}{df} = d\)
  • With respect to $d$: \(\frac{dL}{dd} = f\)

Remember to use Chain Rule to express the derivative of the composition of two differentiable functions.

\[\frac{dz}{dx} = \frac{dz}{dy} ⋅ \frac{dy}{dx}\]

Therefore, since $L = f(c + e)$:

  • With respect to $c$: \(\frac{dL}{dc} = \frac{dL}{dd} ⋅ \frac{dd}{dc} = f ⋅ 1.0 = f\)
  • With respect to $e$: \(\frac{dL}{de} = \frac{dL}{dd} ⋅ \frac{dd}{de} = f ⋅ 1.0 = f\)
  • With respect to $a$: \(\frac{dL}{da} = \frac{dL}{dd} ⋅ \frac{dd}{de} ⋅ \frac{de}{da} = f ⋅ 1.0 ⋅ b = bf\)
  • With respect to $b$: \(\frac{dL}{db} = \frac{dL}{dd} ⋅ \frac{dd}{de} ⋅ \frac{de}{db} = f ⋅ 1.0 ⋅ a = af\)

Based on this derivation, we can manually set the gradient of each node to:

1
2
3
4
5
6
7
a.grad = 6.0
b.grad = -4.0
c.grad = -2.0
e.grad = -2.0
d.grad = -2.0
f.grad = 4.0
L.grad = 1.0
neural network with grad

And with that, we have just completed our first backpropagation!

Backpropagation by definition is a gradient estimation method which computes the gradient of a loss function (for our examle $L$) with respect to the weights of the network. This process involves chain rule to break the computation into pieces (remember how we multiplied the local gradient to the global gradient propagated from the output layer).

Final Step: Optimization

Now, we will see how we can increase the output $L$ by nudging the variables by their gradients.

1
2
3
4
5
6
7
8
9
10
a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.grad

e = a * b
d = e + c
L = d * f

print(L.data)
1
-7.286496

Key Observations

  • Nodes with a positive gradient indicate that increasing the value of the corresponding node will result in an increase in \(L\) (positive correlation).
  • Nodes with a negative gradient indicate that increasing the value of the corresponding node will result in an decrease in \(L\) (negative correlation).

Remember that gradient represents the rate of change of \(L\) with respect to the variable (direction of steepest ascent).

Gradient is updated by: \(x_t = x_{t-1} + \eta \cdot \frac{\partial L}{\partial x}\)

  • $\eta$: Learning Rate
  • $\frac{\partial L}{\partial x}$: Gradient of \(L\) with respect to \(x\)

Conclusion

We have explored the fundamental concepts of derivatives and their role in training neural networks. Along the way, we delved into the chain rule, manually computed gradients, and performed backpropagation using Micrograd. Finally, we touched on an optimization technique, providing a solid foundation for understanding gradient descent.

Resources

This post is licensed under CC BY 4.0 by the author.