Introduction

Neural Network Basics

Essential first step for neural network training:
- Ensures consistent scale across features
- Speeds up training
- Improves model stability
Common techniques:
- Standardization: $x_{new} = \frac{x - \mu}{\sigma}$
  - Zero mean, unit variance
  - Ideal for normal-like distributions
- Min-Max Scaling: $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$
  - Scales to [0,1] range
  - Preserves zero values

Data flows sequentially through the network:
- Inputs are weighted and summed for each neuron
- An activation function introduces non-linearity
Each layer's output becomes the input for the next layer

Z = X \cdot W^T + b\\ A = \text{activation}(Z)

Purpose:
- Add non-linearity to the network
- Enable learning of complex patterns
- Transform neuron outputs
Common activation functions:
- ReLU: $f(x) = \max(0, x)$
  - Most widely used
  - Efficient computation
  - Prevents vanishing gradients
  - Potential "dead neurons" issue
- LeakyReLU: $f(x) = \max(0.01x, x)$
  - Prevents "dead neurons"
  - Small gradient for negative values

Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$
- Output range: (0,1)
- Used in binary classification output layers
- Historical importance in neural networks
Tanh: $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- Output range: (-1,1)
- Zero-centered outputs
- Often used in recurrent networks
Softmax: $f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$
- Used in multiclass classification output layers
- Converts outputs into probabilities (sums to 1)

Role:
- Quantifies the error between predicted ( $\hat{y}$ ) and true values ( $y$ )
- Guides weight and bias adjustments through optimization
- Essential for training the neural network
Where it fits:
- Occurs after the forward pass
- Calculates the error, which is used in backpropagation
- Optimizers (e.g., gradient descent) minimize the loss
Common Loss Functions:
- Cross-Entropy Loss: $L = -(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}))$
  - Used for binary classification problems
  - Penalizes incorrect confident predictions
- Mean Squared Error (MSE): $L = \frac{1}{n}\sum(y - \hat{y})^2$
  - Used for regression problems
  - Sensitive to outliers

Softmax: Converts raw scores into probabilities: $\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$
Multi-class Cross-Entropy Loss: Quantifies the distance between predicted and true distributions: $L = -\sum_{i} y_i \log(\hat{y}_i)$
Efficient implementation:
- Combine Softmax and Cross-Entropy for numerical stability.

Architecture	Data Adapted	Use Cases
Feedforward Networks (FFN)	Structured, tabular data	Regression, simple classification tasks
Convolutional Neural Networks (CNN)	Grid-like data (e.g., images)	Image recognition, object detection, spatial pattern recognition
Recurrent Neural Networks (RNN)	Sequential data (e.g., time series, text)	Natural Language Processing (NLP), time series forecasting
Transformers	Long sequences, complex relationships	Language modeling, translation, long-range dependencies

Purpose and Benefits:
- Stabilizes and accelerates training
- Enables higher learning rates
- Provides regularization effect
Core Concept:
- Normalizes layer inputs to zero mean and unit variance
- Learns optimal scale ( $\gamma$ ) and shift ( $\beta$ )

Training Process:
1. Compute batch statistics: $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
2. Apply learnable parameters: $y = \gamma \hat{x} + \beta$
Inference Phase:
- Use running averages of mean and variance
- Ensures consistent normalization

Dynamic Learning Rates:
- Start with larger rates, decrease over time
- Adapts to training progress
Decay Strategies: $\text{lr}_{\text{new}} = \text{lr}_{\text{initial}} \times \text{decay\_rate}^{\text{epoch} / \text{decay\_step}}$
Adaptive Methods:
- Adam: Adapts per-parameter learning rates
- SGD with momentum: Better final convergence

Classification:
- Accuracy: $\frac{TP + TN}{Total}$
- Precision: $\frac{TP}{TP + FP}$
- Recall: $\frac{TP}{TP + FN}$
- F1 Score: $2 \times \frac{precision \times recall}{precision + recall}$
Regression:
- MSE: $\frac{1}{n}\sum(y - \hat{y})^2$
- MAE: $\frac{1}{n}\sum|y - \hat{y}|$
- R² Score: Explained variance ratio

A method to compute gradients for all network parameters efficiently
- Forward pass: compute predictions
- Loss calculation: measure prediction error
- Backward pass: compute gradients
Practical Example:
- Input: $x = 2$
- Weights: Initial values $w_1 = 3$ , $w_2 = -2$ (to illustrate calculations)
- Hidden: $h = \sigma(w_1 \cdot x) = \sigma(3 \cdot 2) = \sigma(6) \approx 0.9975$
- Output: $\hat{y} = w_2 \cdot h = -2 \cdot 0.9975 \approx -1.995$
- Loss (MSE): $L = \frac{1}{2}(1 - (-1.995))^2 = \frac{1}{2}(2.995)^2 \approx 4.49$

Gradients measure loss changes relative to parameters
Chain rule enables layer-by-layer computation: $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}$
Example Backward Pass:
- $\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2} = (\hat{y} - y) \times h = (-2.995) \times 0.9975 \approx -2.988$
- $\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial h} \times \frac{\partial h}{\partial w_1} = (\hat{y} - y) \times w_2 \times \sigma'(w_1 x) \cdot x \approx -0.015$

Goal: Minimize loss function through parameter updates
Basic Gradient Descent: $W = W - \eta \frac{\partial L}{\partial W}$ Where $\eta$ is the learning rate
Key Variants ⏰:
- Batch GD: Full dataset updates (stable, slow)
- Mini-Batch GD: Subset updates (balanced)
- SGD: Single-sample updates (fast, noisy)

Momentum:
- Incorporates previous updates: $v = \beta v - \eta \nabla L; \, W = W + v$
- Helps overcome local minima
Adaptive Methods:
- Dynamic learning rates (RMSProp, Adam)
- Better handling of sparse gradients

Xavier Initialization: $W \sim \mathcal{N}(0, \frac{1}{\text{input\_size}})$
- Balances the variance of inputs and outputs across layers.
- Effective for sigmoid or tanh activation.
He Initialization: $W \sim \mathcal{N}(0, \frac{2}{\text{input\_size}})$
- Prevents vanishing gradients for ReLU-based activations.
- More robust for deep networks.

Numerical Gradient Computation: $\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}$
Verification Process:
- Compare numerical vs. analytical gradients
- Acceptable error: < 1e-7
- Only use during development

Core Metrics:
- Accuracy and confusion matrix analysis
- Precision: $\frac{TP}{TP + FP}$
- Recall: $\frac{TP}{TP + FN}$
- F1-score for balanced assessment
Advanced Analysis ⏰:
- ROC and PR curves
- Per-class performance breakdown
- Error pattern analysis
Model Interpretability:
- Feature importance rankings
- LIME/SHAP explanations
- Example-based interpretation