Paris-Saclay University

Introduction to Neural Networks

Introduction

Applications of Neural Networks

Healthcare: Predicting patient outcomes, disease prediction
Finance: Algorithmic trading, fraud detection
E-commerce: Personalized recommendations
Transportation: Self-driving cars
Social Media: Content curation, recommendations
Technology: voice assistants

Why Neural Networks?

  • Traditional ML limitations:
    • Manual feature engineering required
    • Difficulty with complex patterns
    • Poor scalability with high-dimensional data
  • Neural Network advantages:
    • Automatic feature learning
    • Superior performance on complex tasks
    • Ability to handle multiple data types
    • Transfer learning capabilities

What are Neural Networks?

  • A computational model inspired by biological neural networks.
    Example of a neuron
  • Composed of layers of interconnected nodes (neurons).
  • Can model complex patterns and relationships in data.

Common Misconceptions

  • "Neural networks work like the human brain"
    • Reality: Only loosely inspired by biological neurons
    • Major differences in structure and learning process
  • "More layers always mean better performance"
    • Reality: Deeper isn't always better
    • Architecture should match problem complexity
  • "Neural networks are a black box"
    • Reality: Various interpretation techniques exist
    • Gradients and activations can be analyzed
  • "Neural networks need massive datasets"
    • Reality: Transfer learning enables use with smaller datasets
    • Data efficiency techniques exist

Brief History

Brief History

Our Objective

  • Understand the foundational concepts of neural networks.
  • Learn how they process and classify data.

Neural Network Basics

Overview

  • Structure of neural networks
  • Data preprocessing
  • Forward pass
  • Activation functions
  • Loss functions
  • Neural network architectures

Structure of a Neural Network

  • Composed of layers:
    • Input layer: Receives input features
    • Hidden layers: Extract patterns and features
    • Output layer: Produces predictions
  • Each layer contains interconnected nodes (neurons)
  • Weights and biases determine the importance of connections
artificial neuron
Multilayer Neural Network

Data Preprocessing

  • Essential first step for neural network training:
    • Ensures consistent scale across features
    • Speeds up training
    • Improves model stability
  • Common techniques:
    • Standardization: $x_{new} = \frac{x - \mu}{\sigma}$
      • Zero mean, unit variance
      • Ideal for normal-like distributions
    • Min-Max Scaling: $x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$
      • Scales to [0,1] range
      • Preserves zero values

Forward Pass

  • Data flows sequentially through the network:
    • Inputs are weighted and summed for each neuron
    • An activation function introduces non-linearity
  • Each layer's output becomes the input for the next layer
\[ Z = X \cdot W^T + b\\ A = \text{activation}(Z) \]

Activation Functions

  • Purpose:
    • Add non-linearity to the network
    • Enable learning of complex patterns
    • Transform neuron outputs
  • Common activation functions:
    • ReLU: $f(x) = \max(0, x)$ Activation rectified linear
      • Most widely used
      • Efficient computation
      • Prevents vanishing gradients
      • Potential "dead neurons" issue
    • LeakyReLU: $f(x) = \max(0.01x, x)$ LeakyReLU
      • Prevents "dead neurons"
      • Small gradient for negative values

Activation Functions (continued) ⏰

  • Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$ Logistic-curve.svg
    • Output range: (0,1)
    • Used in binary classification output layers
    • Historical importance in neural networks
  • Tanh: $tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ Hyperbolic Tangent.svg
    • Output range: (-1,1)
    • Zero-centered outputs
    • Often used in recurrent networks
  • Softmax: $f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$ Softmax
    • Used in multiclass classification output layers
    • Converts outputs into probabilities (sums to 1)

Loss Functions

  • Role:
    • Quantifies the error between predicted (\( \hat{y} \)) and true values (\( y \))
    • Guides weight and bias adjustments through optimization
    • Essential for training the neural network
  • Where it fits:
    • Occurs after the forward pass
    • Calculates the error, which is used in backpropagation
    • Optimizers (e.g., gradient descent) minimize the loss
  • Common Loss Functions:
    • Cross-Entropy Loss: \( L = -(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})) \)
      • Used for binary classification problems
      • Penalizes incorrect confident predictions
    • Mean Squared Error (MSE): \( L = \frac{1}{n}\sum(y - \hat{y})^2 \)
      • Used for regression problems
      • Sensitive to outliers

Softmax and Cross-Entropy Loss

  • Softmax: Converts raw scores into probabilities: \[ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]
  • Multi-class Cross-Entropy Loss: Quantifies the distance between predicted and true distributions: \[ L = -\sum_{i} y_i \log(\hat{y}_i) \]
  • Efficient implementation:
    • Combine Softmax and Cross-Entropy for numerical stability.

Neural Network Architectures

Architecture Data Adapted Use Cases
Feedforward Networks (FFN) Structured, tabular data Regression, simple classification tasks
Convolutional Neural Networks (CNN) Grid-like data (e.g., images) Image recognition, object detection, spatial pattern recognition
Recurrent Neural Networks (RNN) Sequential data (e.g., time series, text) Natural Language Processing (NLP), time series forecasting
Transformers Long sequences, complex relationships Language modeling, translation, long-range dependencies

Transformer Architecture

  • Core Components:
    • Self-attention mechanism
    • Multi-head attention
    • Position encodings
    • Feed-forward networks
  • Key Benefits:
    • Captures long-range dependencies
    • Enables parallel computation
    • Eliminates vanishing gradient issues
  • Applications:
    • Language models (GPT, BERT)
    • Machine translation
    • Vision tasks (ViT)
BERT embeddings

Practice: Forward Pass Calculation

  • Given:
    • Inputs: $X = [1, 0.5]$
    • Weights: $W = [[0.2, 0.8], [0.4, 0.3]]$
    • Biases: $b = [0.1, 0.1]$
  • Calculate:
    • 1. Matrix multiplication: $W \cdot X$
    • 2. Add biases: $+ b$
    • 3. Apply ReLU: $\text{ReLU}(z)$

Key Takeaways

  • Neural networks learn by adjusting weights and biases through layers
  • Proper preprocessing improves training speed and model stability
  • Activation functions enable learning of non-linear patterns
  • Choosing the right architecture depends on data and task requirements
  • Practice calculating forward passes to understand data transformations
  • Transformers are increasingly popular for sequential and complex tasks

Enhancing Neural Networks

Overview

  • Foundation Concepts
  • Training Optimization
  • Hyperparameter Tuning
  • Evaluation and Validation

Understanding Model Fit

  • Three Fundamental Scenarios:
    • Underfitting: Model too simple to capture patterns
      • High training error, high validation error
      • Solution: Increase model complexity
    • Good fit: Model captures true patterns
      • Low training error, low validation error
      • Similar performance on training and test data
    • Overfitting: Model learns noise
      • Low training error, high validation error
      • Solution: Apply regularization techniques

Batch Normalization: Foundation

  • Purpose and Benefits:
    • Stabilizes and accelerates training
    • Enables higher learning rates
    • Provides regularization effect
  • Core Concept:
    • Normalizes layer inputs to zero mean and unit variance
    • Learns optimal scale ($\gamma$) and shift ($\beta$)

Batch Normalization: Implementation

  • Training Process:
    1. Compute batch statistics: \[ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \]
    2. Apply learnable parameters: \[ y = \gamma \hat{x} + \beta \]
  • Inference Phase:
    • Use running averages of mean and variance
    • Ensures consistent normalization

Regularization Techniques

  • Dropout:
    • Training: Randomly disable neurons (p = 0.2-0.5)
    • Inference: Scale outputs by dropout probability
    • Prevents co-adaptation of neurons
  • Weight Regularization:
    • L2 regularization: Penalizes large weights
    • L1 regularization: Promotes sparsity
  • Batch Normalization:
    • Acts as implicit regularizer
    • Reduces internal covariate shift

Learning Rate Optimization ⏰

  • Dynamic Learning Rates:
    • Start with larger rates, decrease over time
    • Adapts to training progress
  • Decay Strategies: \[ \text{lr}_{\text{new}} = \text{lr}_{\text{initial}} \times \text{decay\_rate}^{\text{epoch} / \text{decay\_step}} \]
  • Adaptive Methods:
    • Adam: Adapts per-parameter learning rates
    • SGD with momentum: Better final convergence
Learning rate variations

Comprehensive Hyperparameter Guide

  • Network Architecture:
    • Start simple, add complexity if underfitting
    • First hidden layer ≈ 2/3 input size
    • Layer sizes decrease progressively
  • Training Parameters:
    • Learning rate: Start with [0.1, 0.01, 0.001]
    • Batch size: [32, 64, 128, 256] (GPU dependent)
    • Dropout: 0.2-0.5 (higher for larger layers)

Hyperparameter Optimization ⏰

  • Systematic Approaches:
    • Grid Search: Exhaustive parameter space exploration
    • Random Search: Often more efficient than grid
    • Bayesian Optimization: Learns from previous trials
  • Best Practices:
    • Monitor validation metrics
    • Use logarithmic scales for learning rates
    • Consider computational constraints

Cross-Validation Strategies ⏰

  • K-Fold Cross-Validation:
    • Split data into k parts (k=5 or 10)
    • Train k models with different validation folds
    • Average performance for robust evaluation
  • Special Cases:
    • Stratified K-Fold: Preserves class distribution
    • Time Series Split: Respects temporal order
    • Hold-out Validation: For large datasets

Comprehensive Evaluation Metrics

  • Classification:
    • Accuracy: $\frac{TP + TN}{Total}$
    • Precision: $\frac{TP}{TP + FP}$
    • Recall: $\frac{TP}{TP + FN}$
    • F1 Score: $2 \times \frac{precision \times recall}{precision + recall}$
  • Regression:
    • MSE: $\frac{1}{n}\sum(y - \hat{y})^2$
    • MAE: $\frac{1}{n}\sum|y - \hat{y}|$
    • R² Score: Explained variance ratio

Choosing the Right Metrics

  • Context Considerations:
    • Class balance/imbalance
    • Cost of different error types
    • Business requirements
  • Application Examples:
    • Medical: High recall priority
    • Spam Detection: Precision-recall balance
    • Recommendations: Top-k metrics
  • Metric Strategy:
    • Define primary and secondary metrics
    • Consider multiple metric combinations
    • Align with stakeholder goals

Key Takeaways

  • A good fit balances model complexity and generalization, avoiding underfitting or overfitting
  • Batch normalization accelerates training, stabilizes learning, and provides implicit regularization
  • Regularization techniques like dropout and weight penalties prevent overfitting
  • Dynamic learning rates and adaptive optimizers like Adam improve training efficiency
  • Hyperparameter tuning systematically enhances performance; grid search, random search, and Bayesian methods are effective approaches
  • Cross-validation ensures robust model evaluation and prevents data leakage
  • Metric selection should align with the task and business objectives (e.g., precision for spam detection, recall for medical diagnoses)

Backpropagation and Optimization

Overview

  • Understanding Backpropagation and the Chain Rule
  • Optimization Fundamentals and Variants
  • Common Challenges and Solutions
  • Implementation Best Practices

Understanding Backpropagation

  • A method to compute gradients for all network parameters efficiently
    • Forward pass: compute predictions
    • Loss calculation: measure prediction error
    • Backward pass: compute gradients
  • Practical Example:
    • Input: \( x = 2 \)
    • Weights: Initial values \( w_1 = 3 \), \( w_2 = -2 \) (to illustrate calculations)
    • Hidden: \( h = \sigma(w_1 \cdot x) = \sigma(3 \cdot 2) = \sigma(6) \approx 0.9975 \)
    • Output: \( \hat{y} = w_2 \cdot h = -2 \cdot 0.9975 \approx -1.995 \)
    • Loss (MSE): \( L = \frac{1}{2}(1 - (-1.995))^2 = \frac{1}{2}(2.995)^2 \approx 4.49 \)

The Chain Rule in Action

  • Gradients measure loss changes relative to parameters
  • Chain rule enables layer-by-layer computation: \[ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W} \]
  • Example Backward Pass:
    • \(\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_2} = (\hat{y} - y) \times h = (-2.995) \times 0.9975 \approx -2.988\)
    • \(\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial h} \times \frac{\partial h}{\partial w_1} = (\hat{y} - y) \times w_2 \times \sigma'(w_1 x) \cdot x \approx -0.015\)

Optimization Fundamentals

  • Goal: Minimize loss function through parameter updates
  • Basic Gradient Descent: \[ W = W - \eta \frac{\partial L}{\partial W} \] Where $\eta$ is the learning rate
  • Key Variants ⏰:
    • Batch GD: Full dataset updates (stable, slow)
    • Mini-Batch GD: Subset updates (balanced)
    • SGD: Single-sample updates (fast, noisy)
Gradient Descent Optimization Landscape

Advanced Optimization Methods

  • Momentum:
    • Incorporates previous updates: \[ v = \beta v - \eta \nabla L; \, W = W + v \]
    • Helps overcome local minima
  • Adaptive Methods:
    • Dynamic learning rates (RMSProp, Adam)
    • Better handling of sparse gradients

Common Challenges and Solutions

  • Vanishing Gradients:
    • Problem: Gradients become too small
    • Solution: ReLU activation, proper initialization
  • Exploding Gradients:
    • Problem: Unstable large gradients
    • Solution: Gradient clipping, layer normalization
  • Local Minima:
    • Problem: Suboptimal convergence
    • Solution: Momentum, adaptive learning rates

Weight Initialization ⏰

  • Initialization plays a crucial role in training neural networks.
  • Improper initialization can lead to:
    • Vanishing gradients: Gradients become too small to update weights.
    • Exploding gradients: Gradients become too large, destabilizing training.
  • Common strategies:
    • Random Initialization: Random weights, but may lead to instability.
    • Xavier Initialization: Suited for activations like sigmoid and tanh.
    • He Initialization: Optimized for ReLU and its variants.

Xavier and He Initialization

  • Xavier Initialization: \[ W \sim \mathcal{N}(0, \frac{1}{\text{input\_size}}) \]
    • Balances the variance of inputs and outputs across layers.
    • Effective for sigmoid or tanh activation.
  • He Initialization: \[ W \sim \mathcal{N}(0, \frac{2}{\text{input\_size}}) \]
    • Prevents vanishing gradients for ReLU-based activations.
    • More robust for deep networks.

Exercise 1: Basic Backpropagation

  • Classification Problem Setup:
    • Output activation: $a = [0.7, 0.2, 0.1]$
    • True labels: $y = [1, 0, 0]$
    • Loss function: Cross-Entropy
  • Tasks:
    • 1. Compute loss gradient $\frac{\partial L}{\partial a}$
    • 2. Calculate weight updates
    • 3. Verify gradient magnitudes

Gradient Checking and Verification ⏰

  • Numerical Gradient Computation: \[ \frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} \]
  • Verification Process:
    • Compare numerical vs. analytical gradients
    • Acceptable error: < 1e-7
    • Only use during development

Implementation Best Practices

  • Development Strategy:
    • Implement incrementally with testing
    • Use small, verifiable test cases
    • Monitor intermediate values
  • Common Pitfalls to Avoid:
    • Gradient accumulation errors
    • Broadcasting mistakes
    • Missing activation derivatives
    • Incorrect batch dimension handling

Exercise 2: Complete Implementation

  • Network Architecture:
    • 2 → 2 → 1 neurons
    • ReLU + Sigmoid activations
    • Binary cross-entropy loss
  • Given Parameters:
    • x = [0.5, -0.2]
    • W₁ = [[0.1, 0.3], [-0.2, 0.4]]
    • W₂ = [0.2, -0.5]
    • y = 1
  • Implement:
    • Forward pass
    • Loss calculation
    • Complete backward pass

Key Takeaways

  • Backpropagation efficiently computes gradients through chain rule
  • Different optimization methods suit different scenarios
  • Common challenges have established solutions
  • Proper implementation requires careful testing and debugging
  • Regular gradient checking ensures correct implementation

Text Classification

Overview

  • Text Preprocessing Pipeline
  • Feature Engineering
  • Model Training and Implementation
  • Common Challenges
  • Evaluation and Interpretation

Text Preprocessing Pipeline

  • Basic Cleaning:
    • Convert to lowercase
    • Remove special characters and numbers
    • Handle HTML tags and URLs
    • Remove extra whitespace
  • Text Normalization:
    • Tokenization: Split text into words/subwords
    • Stemming vs. Lemmatization
      • Stemming: running → run (faster, rougher)
      • Lemmatization: better → good (slower, more accurate)
    • Handle contractions (don't → do not)
  • Domain-Specific Processing:
    • Handle hashtags and mentions
    • Process emoticons and emojis
    • Deal with domain jargon

Text Representation & Feature Engineering

  • Basic Representations:
    • Bag of Words: Simple word frequency counts
    • TF-IDF: Term importance weighting
    • Word Embeddings: Dense semantic vectors
  • Advanced Feature Engineering:
    • N-gram Features
      • Capture phrase context (e.g., "not good" vs "good")
      • Balance with dimensionality concerns
    • Subword Tokenization
      • Handles unknown words
      • Examples: BPE, WordPiece

Feature Selection & Enhancement ⏰

  • Statistical Selection Methods:
    • Chi-squared test for feature importance
    • Mutual information scoring
    • Variance thresholding
  • Text Augmentation Techniques:
    • Synonym replacement
    • Back-translation
    • Random insertion/deletion
    • When to use each technique

CountVectorizer

  • Transforms text into a sparse matrix of token counts.
  • Key parameters:
    • max_features: Limits the vocabulary size.
    • stop_words: Removes common words (e.g., "the", "and").
    • min_df: Filters rare words appearing in few documents.
    • max_df: Filters frequent words appearing in most documents.

Model Training & Implementation

  • Data Management:
    • Dataset splitting (80-10-10 ratio)
    • Handling class imbalance
    • Implementing stratified sampling
  • Training Process:
    • Mini-batch gradient descent
    • Learning rate scheduling
    • Early stopping based on validation
  • Performance Monitoring:
    • Training vs. validation metrics
    • Learning curves analysis
    • Model convergence indicators

Common Challenges & Solutions ⏰

  • Technical Challenges:
    • Challenge: Memory issues with large vocabularies
    • Solution: Sparse matrices, batch processing
    • Challenge: Data leakage in preprocessing
    • Solution: Proper train-test isolation
  • Linguistic Challenges:
    • Challenge: Ambiguity and sarcasm
    • Solution: Context-aware features
    • Challenge: Domain-specific terminology
    • Solution: Custom vocabularies, domain adaptation
  • Model Challenges:
    • Challenge: Overfitting to training data
    • Solution: Regularization, dropout
    • Challenge: Poor generalization
    • Solution: Cross-validation, robust evaluation

Evaluation & Interpretation

  • Core Metrics:
    • Accuracy and confusion matrix analysis
    • Precision: $\frac{TP}{TP + FP}$
    • Recall: $\frac{TP}{TP + FN}$
    • F1-score for balanced assessment
  • Advanced Analysis ⏰:
    • ROC and PR curves
    • Per-class performance breakdown
    • Error pattern analysis
  • Model Interpretability:
    • Feature importance rankings
    • LIME/SHAP explanations
    • Example-based interpretation

Key Takeaways

  • Text preprocessing is critical for effective model performance
  • Feature representation (e.g., embeddings, n-grams) impacts results significantly
  • Proper training practices and regularization prevent overfitting
  • Model evaluation requires comprehensive metrics and interpretability
  • Address challenges proactively with tailored solutions
To get the PDF of these slides and print them, click here and then use the PDF printer of your browser.