Vasu Menon

Introduction to Deep Learning

What is deep learning?

Deep learning is a family of function approximation techniques built from compositions of parameterised, differentiable operations. A network with $L$ layers computes

\[f(x) = f^{(L)}(f^{(L-1)}(\cdots f^{(1)}(x;\theta^{(1)})\cdots;\theta^{(L-1)});\theta^{(L)})\]

where each $f^{(\ell)}$ is typically an affine map followed by a pointwise nonlinearity $\sigma$:

\[f^{(\ell)}(z) = \sigma(W^{(\ell)} z + b^{(\ell)})\]

The parameters $\theta = {W^{(\ell)}, b^{(\ell)}}\ell$ are learned by minimising a loss $\mathcal{L}$ over a dataset ${(x_i, y_i)}{i=1}^N$.


Why depth?

Shallow networks (one hidden layer) are universal approximators, but depth buys representational efficiency: functions that require exponentially many neurons to express with one layer can be expressed polynomially compactly with more layers. Intuitively, each layer learns a higher-order abstraction of its input.


Backpropagation

Training relies on computing $\nabla_\theta \mathcal{L}$ efficiently via the chain rule, applied backwards through the computation graph. For a scalar loss:

\[\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \delta^{(\ell)} (z^{(\ell-1)})^\top\]

where the error signal $\delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}}$ is propagated from the output layer back:

\[\delta^{(\ell)} = (W^{(\ell+1)})^\top \delta^{(\ell+1)} \odot \sigma'(z^{(\ell)})\]

Common activation functions

Name Formula Notes
ReLU $\max(0, z)$ Sparse, fast; dead neuron problem
GELU $z \cdot \Phi(z)$ Smooth ReLU approximation; preferred in transformers
Sigmoid $1/(1+e^{-z})$ Saturates; mostly replaced by ReLU in hidden layers
Tanh $(e^z - e^{-z})/(e^z + e^{-z})$ Zero-centred sigmoid

Gradient descent variants

Vanilla SGD updates parameters as $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$. In practice, adaptive methods dominate:

  • Momentum accumulates a velocity $v \leftarrow \mu v - \eta g$, dampening oscillation.
  • Adam maintains per-parameter first and second moment estimates $m, v$ and uses the bias-corrected update $\theta \leftarrow \theta - \eta \hat{m}/(\sqrt{\hat{v}} + \epsilon)$.