Neural Networks Exercises Correction

Exercise 1: Forward Pass Calculation

Step 1: Calculate the Weighted Sum

The formula for the weighted sum is: $Z = W \cdot X + b$

Given:

  • $W = \begin{bmatrix} 0.2 & 0.8 \ 0.4 & 0.3 \end{bmatrix}$
  • $X = \begin{bmatrix} 1 \ 0.5 \end{bmatrix}$
  • $b = \begin{bmatrix} 0.1 \ 0.1 \end{bmatrix}$

Perform the matrix multiplication $W \cdot X$:

$W \cdot X = \begin{bmatrix} 0.2 & 0.8 \ 0.4 & 0.3 \end{bmatrix} \cdot \begin{bmatrix} 1 \ 0.5 \end{bmatrix} = \begin{bmatrix} (0.2 \cdot 1) + (0.8 \cdot 0.5) \ (0.4 \cdot 1) + (0.3 \cdot 0.5) \end{bmatrix} = \begin{bmatrix} 0.2 + 0.4 \ 0.4 + 0.15 \end{bmatrix} = \begin{bmatrix} 0.6 \ 0.55 \end{bmatrix}$

Add the biases $b$ to the result: $Z = \begin{bmatrix} 0.6 \ 0.55 \end{bmatrix} + \begin{bmatrix} 0.1 \ 0.1 \end{bmatrix} = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

Thus, the weighted sum $Z$ is: $Z = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

Step 2: Apply the ReLU Activation Function

The ReLU (Rectified Linear Unit) activation function is defined as: $\text{ReLU}(x) = \max(0, x)$

Apply ReLU element-wise to $Z$: $A = \text{ReLU}(Z) = \begin{bmatrix} \text{ReLU}(0.7) \ \text{ReLU}(0.65) \end{bmatrix} = \begin{bmatrix} \max(0, 0.7) \ \max(0, 0.65) \end{bmatrix} = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

Thus, the output after applying ReLU is: $A = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

Final Results

  1. Weighted Sum $Z$: $Z = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

  2. Activation Output $A$: $A = \begin{bmatrix} 0.7 \ 0.65 \end{bmatrix}$

Additional Remarks for Students

  • Matrix Multiplication Order: Ensure the order of operations is correct ($W \cdot X$). Verify dimensions are compatible: $W$ is $2 \times 2$ and $X$ is $2 \times 1$, resulting in a $2 \times 1$ vector.
  • Bias Addition: Biases are added element-wise, aligning dimensions appropriately.
  • ReLU Behavior: The ReLU activation function outputs the input directly if it is positive; otherwise, it outputs zero.

Exercise 2: Backpropagation

Step 1: Forward Pass

Hidden Layer Activation

The formula for the hidden layer’s weighted sum and activation is: $Z_1 = W_1 \cdot X, \quad A_1 = \text{ReLU}(Z_1)$

Given:

  • $X = \begin{bmatrix} 0.5 \ -0.2 \end{bmatrix}$
  • $W_1 = \begin{bmatrix} 0.1 & 0.3 \ -0.2 & 0.4 \end{bmatrix}$

Compute $Z_1$: $Z_1 = \begin{bmatrix} 0.1 & 0.3 \ -0.2 & 0.4 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \ -0.2 \end{bmatrix} = \begin{bmatrix} (0.1 \cdot 0.5) + (0.3 \cdot -0.2) \ (-0.2 \cdot 0.5) + (0.4 \cdot -0.2) \end{bmatrix} = \begin{bmatrix} 0.05 - 0.06 \ -0.1 - 0.08 \end{bmatrix} = \begin{bmatrix} -0.01 \ -0.18 \end{bmatrix}$

Apply ReLU: $A_1 = \text{ReLU}(Z_1) = \begin{bmatrix} \max(0, -0.01) \ \max(0, -0.18) \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Hidden Layer Output: $A_1 = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Output Layer Activation

The output layer applies the formula: $Z_2 = W_2 \cdot A_1, \quad A_2 = \sigma(Z_2)$ where $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the Sigmoid function.

Given:

  • $W_2 = \begin{bmatrix} 0.2 \ -0.5 \end{bmatrix}$
  • $A_1 = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Compute $Z_2$: $Z_2 = W_2 \cdot A_1 = \begin{bmatrix} 0.2 \ -0.5 \end{bmatrix} \cdot \begin{bmatrix} 0 \ 0 \end{bmatrix} = 0$

Apply Sigmoid: $A_2 = \sigma(Z_2) = \frac{1}{1 + e^{-0}} = 0.5$

Output Layer Result: $A_2 = 0.5$

Step 2: Calculate Binary Cross-Entropy Loss

The binary cross-entropy loss is: $L = -\left(y \log(A_2) + (1 - y) \log(1 - A_2)\right)$

Given:

  • $y = 1$
  • $A_2 = 0.5$

Compute $L$: $L = -\left(1 \cdot \log(0.5) + (1 - 1) \cdot \log(1 - 0.5)\right) = -\log(0.5) = -(-0.693) = 0.693$

Loss: $L = 0.693$

Step 3: Backpropagation

Gradient for $W_2$

The gradient of the loss with respect to $W_2$ is: $\frac{\partial L}{\partial W_2} = \delta_2 \cdot A_1^T$ where: $\delta_2 = A_2 - y$

Compute $\delta_2$: $\delta_2 = 0.5 - 1 = -0.5$

Compute $\frac{\partial L}{\partial W_2}$: $\frac{\partial L}{\partial W_2} = -0.5 \cdot \begin{bmatrix} 0 \ 0 \end{bmatrix}^T = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Gradient for $W_2$: $\frac{\partial L}{\partial W_2} = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Gradient for $W_1$

The gradient of the loss with respect to $W_1$ is: $\frac{\partial L}{\partial W_1} = \delta_1 \cdot X^T$ where: $\delta_1 = (\delta_2 \cdot W_2^T) \odot \text{ReLU}’(Z_1)$

Step 1: Compute $\delta_2 \cdot W_2^T$: $\delta_2 \cdot W_2^T = -0.5 \cdot \begin{bmatrix} 0.2 & -0.5 \end{bmatrix} = \begin{bmatrix} -0.1 & 0.25 \end{bmatrix}$

Step 2: Apply $\text{ReLU}’(Z_1)$: The derivative of ReLU is: $\text{ReLU}’(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases}$

Since $Z_1 = \begin{bmatrix} -0.01 \ -0.18 \end{bmatrix}$, the derivative is: $\text{ReLU}’(Z_1) = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Step 3: Compute $\delta_1$: $\delta_1 = \begin{bmatrix} -0.1 \ 0.25 \end{bmatrix} \odot \begin{bmatrix} 0 \ 0 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix}$

Step 4: Compute $\frac{\partial L}{\partial W_1}$: $\frac{\partial L}{\partial W_1} = \begin{bmatrix} 0 \ 0 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \ -0.2 \end{bmatrix}^T = \begin{bmatrix} 0 & 0 \ 0 & 0 \end{bmatrix}$

Gradient for $W_1$: $\frac{\partial L}{\partial W_1} = \begin{bmatrix} 0 & 0 \ 0 & 0 \end{bmatrix}$

Final Results

  1. Forward Pass:
    • $A_1 = \begin{bmatrix} 0 \ 0 \end{bmatrix}$
    • $A_2 = 0.5$
  2. Binary Cross-Entropy Loss:
    • $L = 0.693$
  3. Gradients:
    • $\frac{\partial L}{\partial W_2} = \begin{bmatrix} 0 \ 0 \end{bmatrix}$
    • $\frac{\partial L}{\partial W_1} = \begin{bmatrix} 0 & 0 \ 0 & 0 \end{bmatrix}$

Additional Remarks for Students

  1. ReLU Activation’s Impact: Since all values in $Z_1$ were negative, the hidden layer output was completely zero, leading to no gradient flow back to $W_1$ and $W_2$.
  2. Numerical Stability: Always check intermediate values during calculations to avoid numerical issues, particularly with Sigmoid or softmax activations.
  3. Chain Rule in Backpropagation: Gradients are computed step-by-step using the chain rule, highlighting dependencies between layers.

Exercise 3: Data Preprocessing

Step 1: Apply Min-Max Scaling

Min-Max scaling scales each feature to a specific range, typically ([0, 1]). The formula is: $X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$

Given: $X = \begin{bmatrix} 5 & 20 & 10 \ 15 & 5 & 25 \ 10 & 30 & 15 \end{bmatrix}$

Step 1.1: Identify the minimum and maximum for each feature

  • For feature 1: $X_{\text{min}} = 5$, $X_{\text{max}} = 15$
  • For feature 2: $X_{\text{min}} = 5$, $X_{\text{max}} = 30$
  • For feature 3: $X_{\text{min}} = 10$, $X_{\text{max}} = 25$

Step 1.2: Apply the formula for each feature

For feature 1: $X_{\text{scaled},1} = \frac{X_1 - 5}{15 - 5} = \frac{X_1 - 5}{10}$

For feature 2: $X_{\text{scaled},2} = \frac{X_2 - 5}{30 - 5} = \frac{X_2 - 5}{25}$

For feature 3: $X_{\text{scaled},3} = \frac{X_3 - 10}{25 - 10} = \frac{X_3 - 10}{15}$

Step 1.3: Perform the calculations

$X_{\text{scaled}} = \begin{bmatrix} \frac{5 - 5}{10} & \frac{20 - 5}{25} & \frac{10 - 10}{15} \ \frac{15 - 5}{10} & \frac{5 - 5}{25} & \frac{25 - 10}{15} \ \frac{10 - 5}{10} & \frac{30 - 5}{25} & \frac{15 - 10}{15} \end{bmatrix} = \begin{bmatrix} 0 & 0.6 & 0 \ 1 & 0 & 1 \ 0.5 & 1 & 0.333 \end{bmatrix}$

Min-Max Scaled Dataset: $X_{\text{scaled}} = \begin{bmatrix} 0 & 0.6 & 0 \ 1 & 0 & 1 \ 0.5 & 1 & 0.333 \end{bmatrix}$

Step 2: Standardize Features

Standardization rescales features to have zero mean and unit variance. The formula is: $X_{\text{standardized}} = \frac{X - \mu}{\sigma}$ where $\mu$ is the mean of each feature, and $\sigma$ is the standard deviation of each feature.

Step 2.1: Calculate the mean ($\mu$) and standard deviation ($\sigma$) for each feature

For feature 1: $\mu_1 = \frac{5 + 15 + 10}{3} = 10, \quad \sigma_1 = \sqrt{\frac{(5-10)^2 + (15-10)^2 + (10-10)^2}{3}} = \sqrt{\frac{25 + 25 + 0}{3}} = \sqrt{\frac{50}{3}} \approx 4.08$

For feature 2: $\mu_2 = \frac{20 + 5 + 30}{3} = 18.33, \quad \sigma_2 = \sqrt{\frac{(20-18.33)^2 + (5-18.33)^2 + (30-18.33)^2}{3}} \approx \sqrt{\frac{2.78 + 177.78 + 136.11}{3}} \approx \sqrt{105.56} \approx 10.28$

For feature 3: $\mu_3 = \frac{10 + 25 + 15}{3} = 16.67, \quad \sigma_3 = \sqrt{\frac{(10-16.67)^2 + (25-16.67)^2 + (15-16.67)^2}{3}} \approx \sqrt{\frac{44.44 + 69.44 + 2.78}{3}} \approx \sqrt{38.89} \approx 6.24$

Step 2.2: Apply the formula for each feature

For feature 1: $X_{\text{standardized},1} = \frac{X_1 - 10}{4.08}$

For feature 2: $X_{\text{standardized},2} = \frac{X_2 - 18.33}{10.28}$

For feature 3: $X_{\text{standardized},3} = \frac{X_3 - 16.67}{6.24}$

Step 2.3: Perform the calculations

$X_{\text{standardized}} = \begin{bmatrix} \frac{5 - 10}{4.08} & \frac{20 - 18.33}{10.28} & \frac{10 - 16.67}{6.24} \ \frac{15 - 10}{4.08} & \frac{5 - 18.33}{10.28} & \frac{25 - 16.67}{6.24} \ \frac{10 - 10}{4.08} & \frac{30 - 18.33}{10.28} & \frac{15 - 16.67}{6.24} \end{bmatrix} = \begin{bmatrix} -1.225 & 0.162 & -1.069 \ 1.225 & -1.297 & 1.336 \ 0 & 1.297 & -0.267 \end{bmatrix}$

Standardized Dataset: $X_{\text{standardized}} = \begin{bmatrix} -1.225 & 0.162 & -1.069 \ 1.225 & -1.297 & 1.336 \ 0 & 1.297 & -0.267 \end{bmatrix}$

Step 3: Comparison of Approaches

Min-Max Scaling:

  • Advantages:
    • Ensures all features are in the same range (e.g., [0, 1]).
    • Useful for algorithms sensitive to the scale of inputs (e.g., gradient descent, neural networks).
  • Disadvantages:
    • Sensitive to outliers, as extreme values can dominate the range and skew scaling.

Standardization:

  • Advantages:
    • Centers the data around zero and scales by variance, which is beneficial for distance-based algorithms (e.g., SVM, KNN).
    • Less sensitive to outliers compared to Min-Max scaling.
  • Disadvantages:
    • Does not bound data to a specific range, which may not suit algorithms requiring normalized inputs.

When to Use:

  • Min-Max Scaling: Use when the algorithm assumes data within a fixed range or when features have the same importance but different ranges (e.g., neural networks).
  • Standardization: Use when the algorithm is distance-based or sensitive to variance and when outliers are present.

Final Results

  1. Min-Max Scaled Dataset: $X_{\text{scaled}} = \begin{bmatrix} 0 & 0.6 & 0 \ 1 & 0 & 1 \ 0.5 & 1 & 0.333 \end{bmatrix}$

  2. Standardized Dataset: $X_{\text{standardized}} = \begin{bmatrix} -1.225 & 0.162 & -1.069 \ 1.225 & -1.297 & 1.336 \ 0 & 1.297 & -0.267 \end{bmatrix}$

Additional Remarks for Students

  • Always preprocess data before training a neural network, as inconsistent feature scales can hinder convergence.
  • Experiment with both methods on a validation set to determine which works better for your specific model and dataset.

Exercise 4: Activation Functions

Step 1: Compute Outputs for Each Activation Function

ReLU (Rectified Linear Unit)

The ReLU function is defined as: $\text{ReLU}(x) = \max(0, x)$

Given: $X = [-2, -1, 0, 1, 2]$

Compute ReLU for each value in $X$: $\text{ReLU}(-2) = \max(0, -2) = 0$ $\text{ReLU}(-1) = \max(0, -1) = 0$ $\text{ReLU}(0) = \max(0, 0) = 0$ $\text{ReLU}(1) = \max(0, 1) = 1$ $\text{ReLU}(2) = \max(0, 2) = 2$

ReLU Outputs: $\text{ReLU}(X) = [0, 0, 0, 1, 2]$

Leaky ReLU

The Leaky ReLU function introduces a small slope $\alpha$ for negative inputs: $\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \leq 0 \end{cases}$

Given: $X = [-2, -1, 0, 1, 2], , \alpha = 0.01$

Compute Leaky ReLU for each value in $X$: $\text{LeakyReLU}(-2) = 0.01 \cdot -2 = -0.02$ $\text{LeakyReLU}(-1) = 0.01 \cdot -1 = -0.01$ $\text{LeakyReLU}(0) = 0$ $\text{LeakyReLU}(1) = 1$ $\text{LeakyReLU}(2) = 2$

Leaky ReLU Outputs: $\text{LeakyReLU}(X) = [-0.02, -0.01, 0, 1, 2]$

Sigmoid

The Sigmoid function squashes inputs into the range ([0, 1]): $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$

Given: $X = [-2, -1, 0, 1, 2]$

Compute Sigmoid for each value in $X$: $\text{Sigmoid}(-2) = \frac{1}{1 + e^{2}} \approx 0.119$ $\text{Sigmoid}(-1) = \frac{1}{1 + e^{1}} \approx 0.269$ $\text{Sigmoid}(0) = \frac{1}{1 + e^{0}} = 0.5$ $\text{Sigmoid}(1) = \frac{1}{1 + e^{-1}} \approx 0.731$ $\text{Sigmoid}(2) = \frac{1}{1 + e^{-2}} \approx 0.881$

Sigmoid Outputs: $\text{Sigmoid}(X) = [0.119, 0.269, 0.5, 0.731, 0.881]$

Tanh (Hyperbolic Tangent)

The Tanh function squashes inputs into the range ([-1, 1]): $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

Given: $X = [-2, -1, 0, 1, 2]$

Compute Tanh for each value in $X$: $\text{Tanh}(-2) = \frac{e^{-2} - e^{2}}{e^{-2} + e^{2}} \approx -0.964$ $\text{Tanh}(-1) = \frac{e^{-1} - e^{1}}{e^{-1} + e^{1}} \approx -0.761$ $\text{Tanh}(0) = 0$ $\text{Tanh}(1) = \frac{e^{1} - e^{-1}}{e^{1} + e^{-1}} \approx 0.761$ $\text{Tanh}(2) = \frac{e^{2} - e^{-2}}{e^{2} + e^{-2}} \approx 0.964$

Tanh Outputs: $\text{Tanh}(X) = [-0.964, -0.761, 0, 0.761, 0.964]$

Step 2: Sketch the Graphs

Graphs of the functions show their unique behavior:

  1. ReLU: Passes positive inputs directly; outputs 0 for negative inputs.
  2. Leaky ReLU: Similar to ReLU but allows a small negative slope for negative inputs.
  3. Sigmoid: Smoothly maps inputs into ([0, 1]), with a steep slope near 0.
  4. Tanh: Smoothly maps inputs into ([-1, 1]), with a steep slope near 0.

Step 3: Compare Advantages and Disadvantages

ReLU

  • Advantages:
    • Computationally efficient.
    • Helps mitigate the vanishing gradient problem for positive inputs.
  • Disadvantages:
    • Suffers from the “dying ReLU” problem where some neurons output 0 and stop learning (no gradient flow).

Leaky ReLU

  • Advantages:
    • Addresses the “dying ReLU” problem by allowing small gradients for negative inputs.
    • Simple modification of ReLU with minimal computational cost.
  • Disadvantages:
    • The slope parameter $\alpha$ may need to be tuned.

Sigmoid

  • Advantages:
    • Outputs are always between ([0, 1]), making it useful for binary classification.
    • Differentiable and smooth.
  • Disadvantages:
    • Suffers from the vanishing gradient problem for large positive or negative inputs.
    • Outputs not zero-centered, which may slow down convergence.

Tanh

  • Advantages:
    • Outputs are zero-centered, aiding convergence during training.
    • Works well when inputs need to be scaled to ([-1, 1]).
  • Disadvantages:
    • Suffers from the vanishing gradient problem for large positive or negative inputs.
    • Computationally more expensive than ReLU.

Final Results

  1. Outputs:
Input ($X$)ReLULeaky ReLUSigmoidTanh
(-2)0-0.020.119-0.964
(-1)0-0.010.269-0.761
0000.50
1110.7310.761
2220.8810.964
  1. Graph Descriptions:

    • ReLU: Linear increase for $x > 0$, flat at 0 for $x \leq 0$.
    • Leaky ReLU: Similar to ReLU but with a slight negative slope for $x \leq 0$.
    • Sigmoid: S-shaped curve bounded between 0 and 1.
    • Tanh: S-shaped curve bounded between -1 and 1.
  2. Usage Recommendations:

    • ReLU: Default for most hidden layers in deep networks.
    • Leaky ReLU: Use when ReLU suffers from the “dying ReLU” problem.
    • Sigmoid: Suitable for binary outputs, particularly in the last layer.
    • Tanh: Useful for zero-centered outputs, especially in intermediate layers.

Exercise 5: Gradient Checking

Step 1: Analytical Gradient Calculation

The loss function is: $L = \frac{1}{2}(y - \hat{y})^2$

The predicted value is: $\hat{y} = W \cdot x + b$

The analytical gradient of $L$ with respect to $W$ is obtained using the chain rule: $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W}$

Step 1.1: Compute Partial Derivatives

  • The derivative of $L$ with respect to $\hat{y}$: $\frac{\partial L}{\partial \hat{y}} = -(y - \hat{y})$

  • The derivative of $\hat{y}$ with respect to $W$: $\frac{\partial \hat{y}}{\partial W} = x$

Step 1.2: Combine the Results

$\frac{\partial L}{\partial W} = -(y - \hat{y}) \cdot x$

Step 1.3: Substitute Values

Given:

  • $y = 1$,
  • $W = 0.5$,
  • $x = 2$,
  • $b = 0.1$,

Compute $\hat{y}$: $\hat{y} = W \cdot x + b = 0.5 \cdot 2 + 0.1 = 1.1$

Substitute into the gradient formula: $\frac{\partial L}{\partial W} = -(1 - 1.1) \cdot 2 = -(-0.1) \cdot 2 = 0.2$

Analytical Gradient: $\frac{\partial L}{\partial W} = 0.2$

Step 2: Numerical Gradient Approximation

The numerical approximation for the gradient is: $\frac{\partial L}{\partial W} \approx \frac{L(W + \epsilon) - L(W - \epsilon)}{2\epsilon}$

Step 2.1: Compute $L(W + \epsilon)$

Let $\epsilon = 10^{-4}$.

  • For $W + \epsilon = 0.5 + 10^{-4} = 0.5001$: $\hat{y}(W + \epsilon) = 0.5001 \cdot 2 + 0.1 = 1.1002$ $L(W + \epsilon) = \frac{1}{2}(1 - 1.1002)^2 = \frac{1}{2}(-0.1002)^2 = \frac{1}{2}(0.01004) = 0.00502$

Step 2.2: Compute $L(W - \epsilon)$

  • For $W - \epsilon = 0.5 - 10^{-4} = 0.4999$: $\hat{y}(W - \epsilon) = 0.4999 \cdot 2 + 0.1 = 1.0998$ $L(W - \epsilon) = \frac{1}{2}(1 - 1.0998)^2 = \frac{1}{2}(-0.0998)^2 = \frac{1}{2}(0.00996) = 0.00498$

Step 2.3: Compute the Numerical Gradient

$\frac{\partial L}{\partial W} \approx \frac{L(W + \epsilon) - L(W - \epsilon)}{2\epsilon}$ Substitute the computed values: $\frac{\partial L}{\partial W} \approx \frac{0.00502 - 0.00498}{2 \cdot 10^{-4}} = \frac{0.00004}{0.0002} = 0.2$

Numerical Gradient: $\frac{\partial L}{\partial W} \approx 0.2$

Step 3: Comparison of Results

  • Analytical Gradient: $0.2$
  • Numerical Gradient: $0.2$

The results are identical, confirming that the analytical gradient is correct.

Step 4: Discussion

Why Perform Gradient Checking?

Gradient checking is a way to verify that the backpropagation implementation is correct by comparing the analytical gradients to numerical approximations.

Accuracy of the Results

The two results match, meaning the analytical gradient calculation and implementation of backpropagation for this simple case are likely correct.

Potential Differences

If there were a discrepancy between the two results, possible causes could include:

  1. Implementation Errors: Mistakes in the analytical gradient derivation or coding.
  2. Finite Differences Approximation: Numerical gradients are approximations and may suffer from rounding errors if $\epsilon$ is too large or too small.
  3. Non-smooth Functions: Numerical gradient approximation may struggle with non-smooth functions (e.g., ReLU).

Practical Advice

  • Use gradient checking for debugging during the implementation phase, but avoid using it during regular training as it is computationally expensive.
  • Ensure $\epsilon$ is small enough to minimize approximation errors but not too small to cause numerical instability.

Final Results

  1. Analytical Gradient: $\frac{\partial L}{\partial W} = 0.2$

  2. Numerical Gradient: $\frac{\partial L}{\partial W} \approx 0.2$

  3. Comparison:

    • The results are identical, confirming correctness.
  4. Key Takeaways:

    • Gradient checking is a crucial debugging tool for verifying backpropagation.
    • Analytical gradients are computationally efficient and precise, while numerical gradients are used as a validation benchmark.

Exercise 6: Regularization

Step 1: Compute the Weight Penalty Term

The L2 regularization penalty is given by: $\text{Penalty} = \lambda \sum W^2$

Given:

  • $W = [1, -2, 0.5]$
  • $\lambda = 0.01$

Step 1.1: Compute $\sum W^2$

$\sum W^2 = 1^2 + (-2)^2 + (0.5)^2 = 1 + 4 + 0.25 = 5.25$

Step 1.2: Compute the penalty

$\text{Penalty} = \lambda \cdot \sum W^2 = 0.01 \cdot 5.25 = 0.0525$

Weight Penalty Term: $\text{Penalty} = 0.0525$

Step 2: Update the Weights Using Gradient Descent

In gradient descent with L2 regularization, the weight update rule is: $W_{\text{new}} = W_{\text{old}} - \eta \left( \frac{\partial L}{\partial W} + \lambda W \right)$

Assumptions for This Exercise

  • The gradient of the loss with respect to weights ($\frac{\partial L}{\partial W}$) is not provided, so we’ll use $\frac{\partial L}{\partial W} = [0.1, -0.4, 0.2]$ as an example.

Step 2.1: Compute Regularization Term

The regularization term for each weight is: $\lambda W = [\lambda \cdot 1, \lambda \cdot -2, \lambda \cdot 0.5] = [0.01 \cdot 1, 0.01 \cdot -2, 0.01 \cdot 0.5] = [0.01, -0.02, 0.005]$

Step 2.2: Update Each Weight

For each weight $W_i$: $W_{\text{new},i} = W_{\text{old},i} - \eta \left( \frac{\partial L}{\partial W_i} + \lambda W_i \right)$

  • For $W_1 = 1$: $W_{\text{new},1} = 1 - 0.1 \cdot (0.1 + 0.01) = 1 - 0.1 \cdot 0.11 = 1 - 0.011 = 0.989$

  • For $W_2 = -2$: $W_{\text{new},2} = -2 - 0.1 \cdot (-0.4 + (-0.02)) = -2 - 0.1 \cdot -0.42 = -2 + 0.042 = -1.958$

  • For $W_3 = 0.5$: $W_{\text{new},3} = 0.5 - 0.1 \cdot (0.2 + 0.005) = 0.5 - 0.1 \cdot 0.205 = 0.5 - 0.0205 = 0.4795$

Updated Weights: $W_{\text{new}} = [0.989, -1.958, 0.4795]$

Step 3: Discuss How Regularization Affects Model Training

Purpose of L2 Regularization

L2 regularization, also known as weight decay, adds a penalty proportional to the square of the weights to the loss function. This encourages the model to:

  1. Keep weights small.
  2. Reduce overfitting by discouraging reliance on large weights, which may cause the model to memorize the training data.

Impact on Weight Updates

  • Smaller Weights: The regularization term $\lambda W$ reduces the magnitude of the weights during updates.
  • Regularization Tradeoff: The penalty balances the model’s goal of minimizing the loss and keeping weights small. A larger $\lambda$ increases this effect, potentially leading to underfitting if too strong.

When to Use Regularization

  • Use L2 Regularization When:
    • Overfitting is a concern.
    • Input features have varying scales (in conjunction with feature scaling).
  • Avoid Overregularization:
    • If $\lambda$ is too large, the model may fail to capture important patterns in the data.

Final Results

  1. Weight Penalty Term: $\text{Penalty} = 0.0525$

  2. Updated Weights: $W_{\text{new}} = [0.989, -1.958, 0.4795]$

  3. Key Takeaways:

    • Regularization reduces overfitting by penalizing large weights.
    • Weight updates incorporate both the gradient of the loss and the regularization term.
    • Choosing the regularization strength $\lambda$ is crucial to balance bias and variance in the model.

Exercise 7: Neural Network Error Analysis

Step 1: Calculate the Loss

The binary cross-entropy loss is given by: $L = - \left( y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right)$

Given:

  • $y = 1$
  • $\hat{y} = 0.4$

Step 1.1: Substitute Values

$L = - \left( 1 \cdot \log(0.4) + (1 - 1) \cdot \log(1 - 0.4) \right) = - \log(0.4)$

Step 1.2: Compute the Logarithm

$\log(0.4) \approx -0.916$

$L = -(-0.916) = 0.916$

Binary Cross-Entropy Loss: $L \approx 0.916$

Step 2: Determine Potential Issues with Weight Initialization

Weight initialization affects gradient flow, influencing whether gradients vanish (become too small) or explode (become too large). This can lead to slow learning or unstable training.

Step 2.1: Analyze Current Initialization

Weights $W_1$: $W_1 = \begin{bmatrix} 0.5 & 0.2 \ -0.3 & 0.8 \end{bmatrix}$ These values are moderately sized, neither too large nor too small, which should avoid immediate vanishing or exploding gradients. However:

  • Positive and negative values balance the signal, which is good.
  • The range (from (-0.3) to (0.8)) could still introduce some instability during backpropagation, depending on the input scale.

Weights $W_2$: $W_2 = [0.7, -0.6]$ These values are larger than those in $W_1$, which may amplify gradients at later stages in the network. This could cause instability, especially when combined with the biases.

Biases $b_1$ and $b_2$: $b_1 = [0.1, -0.1], , b_2 = 0.2$ The biases are small and likely do not contribute significantly to vanishing or exploding gradients.

Step 2.2: Analyze Activation Functions

  • ReLU in the Hidden Layer:

    • ReLU outputs zero for negative inputs, which can cause the “dying ReLU” problem (where neurons output 0 and stop learning if weights and inputs align poorly).
    • The weights in $W_1$ could lead to this issue for certain inputs.
  • Sigmoid in the Output Layer:

    • Sigmoid squashes outputs to ([0, 1]). For large inputs, its gradient is close to zero, which can contribute to the vanishing gradient problem during backpropagation.

Conclusion:

The current weight initialization and activation functions are not inherently problematic but could contribute to:

  • Vanishing Gradients: Due to Sigmoid activation and potentially imbalanced weights.
  • Dying ReLU Neurons: If initial weights and inputs lead to zero activations in the hidden layer.

Step 3: Propose Changes to Improve Performance

1. Adjust Weight Initialization

Why?

  • Proper initialization ensures balanced gradients and stable training.
  • Avoid very large or very small initial weights to prevent exploding or vanishing gradients.

Proposed Method:

  • Use Xavier Initialization (also called Glorot Initialization):
    • For weights $W$, initialize as: $W \sim \mathcal{U} \left( -\frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}, \frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}} \right)$ where $n_{\text{in}}$ and $n_{\text{out}}$ are the number of input and output neurons for the layer.
  • Alternatively, use He Initialization (specific for ReLU):
    • For weights $W$, initialize as: $W \sim \mathcal{N}(0, \frac{2}{n_{\text{in}}})$

2. Replace Sigmoid Activation in the Output Layer

Why?

  • Sigmoid activations can cause vanishing gradients for large inputs.

Proposed Change:

  • Use Softmax for multi-class classification or Tanh for zero-centered outputs. However, for binary classification:
    • Retain Sigmoid but ensure scaled inputs.

3. Modify Network Architecture

Why?

  • Deeper networks often benefit from regularization and additional layers.

Proposed Changes:

  • Add Batch Normalization after the hidden layer to normalize activations, improving gradient flow and reducing dependence on initialization.
  • Add Dropout to mitigate overfitting.

4. Experiment with Learning Rate

Why?

  • A high learning rate can cause gradients to explode, while a low learning rate can lead to vanishing updates.

Proposed Change:

  • Use a learning rate scheduler to adapt the learning rate during training.

Final Results

  1. Loss Calculation: $L \approx 0.916$

  2. Weight Initialization Analysis:

    • Current weights and biases are moderately scaled but could still cause vanishing gradients (due to Sigmoid) or dying ReLU neurons.
    • Issues are not severe but may worsen as the network becomes deeper.
  3. Proposed Improvements:

    • Adjust weight initialization with Xavier or He Initialization.
    • Replace Sigmoid activation or normalize inputs to avoid saturation.
    • Add Batch Normalization and Dropout for better gradient flow and reduced overfitting.
    • Use a learning rate scheduler for stable training.
  4. Key Takeaways:

    • Proper initialization and architecture design are crucial to prevent vanishing or exploding gradients.
    • Activation functions play a significant role in stabilizing training and improving model performance.
Pierre-Henri Paris
Pierre-Henri Paris
Associate Professor in Artificial Intelligence

My research interests include Knowlegde Graphs, Information Extraction, and NLP.