What is deep learning?
Deep learning is a family of function approximation techniques built from compositions of parameterised, differentiable operations. A network with $L$ layers computes
\[f(x) = f^{(L)}(f^{(L-1)}(\cdots f^{(1)}(x;\theta^{(1)})\cdots;\theta^{(L-1)});\theta^{(L)})\]where each $f^{(\ell)}$ is typically an affine map followed by a pointwise nonlinearity $\sigma$:
\[f^{(\ell)}(z) = \sigma(W^{(\ell)} z + b^{(\ell)})\]The parameters $\theta = {W^{(\ell)}, b^{(\ell)}}\ell$ are learned by minimising a loss $\mathcal{L}$ over a dataset ${(x_i, y_i)}{i=1}^N$.
Why depth?
Shallow networks (one hidden layer) are universal approximators, but depth buys representational efficiency: functions that require exponentially many neurons to express with one layer can be expressed polynomially compactly with more layers. Intuitively, each layer learns a higher-order abstraction of its input.
Backpropagation
Training relies on computing $\nabla_\theta \mathcal{L}$ efficiently via the chain rule, applied backwards through the computation graph. For a scalar loss:
\[\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \delta^{(\ell)} (z^{(\ell-1)})^\top\]where the error signal $\delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}}$ is propagated from the output layer back:
\[\delta^{(\ell)} = (W^{(\ell+1)})^\top \delta^{(\ell+1)} \odot \sigma'(z^{(\ell)})\]Common activation functions
| Name | Formula | Notes |
|---|---|---|
| ReLU | $\max(0, z)$ | Sparse, fast; dead neuron problem |
| GELU | $z \cdot \Phi(z)$ | Smooth ReLU approximation; preferred in transformers |
| Sigmoid | $1/(1+e^{-z})$ | Saturates; mostly replaced by ReLU in hidden layers |
| Tanh | $(e^z - e^{-z})/(e^z + e^{-z})$ | Zero-centred sigmoid |
Gradient descent variants
Vanilla SGD updates parameters as $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$. In practice, adaptive methods dominate:
- Momentum accumulates a velocity $v \leftarrow \mu v - \eta g$, dampening oscillation.
- Adam maintains per-parameter first and second moment estimates $m, v$ and uses the bias-corrected update $\theta \leftarrow \theta - \eta \hat{m}/(\sqrt{\hat{v}} + \epsilon)$.