Neural Networks Exercises Correction
Exercise 1: Forward Pass Calculation
Step 1: Calculate the Weighted Sum
The formula for the weighted sum is:
Given:
Perform the matrix multiplication
Add the biases
Thus, the weighted sum
Step 2: Apply the ReLU Activation Function
The ReLU (Rectified Linear Unit) activation function is defined as:
Apply ReLU element-wise to
Thus, the output after applying ReLU is:
Final Results
Weighted Sum
:Activation Output
:
Additional Remarks for Students
- Matrix Multiplication Order: Ensure the order of operations is correct (
). Verify dimensions are compatible: is and is , resulting in a vector. - Bias Addition: Biases are added element-wise, aligning dimensions appropriately.
- ReLU Behavior: The ReLU activation function outputs the input directly if it is positive; otherwise, it outputs zero.
Exercise 2: Backpropagation
Step 1: Forward Pass
Hidden Layer Activation
The formula for the hidden layer’s weighted sum and activation is:
Given:
Compute
Apply ReLU:
Hidden Layer Output:
Output Layer Activation
The output layer applies the formula:
Given:
Compute
Apply Sigmoid:
Output Layer Result:
Step 2: Calculate Binary Cross-Entropy Loss
The binary cross-entropy loss is:
Given:
Compute
Loss:
Step 3: Backpropagation
Gradient for
The gradient of the loss with respect to
Compute
Compute
Gradient for
Gradient for
The gradient of the loss with respect to
Step 1: Compute
Step 2: Apply
Since
Step 3: Compute
Step 4: Compute
Gradient for
Final Results
- Forward Pass:
- Binary Cross-Entropy Loss:
- Gradients:
Additional Remarks for Students
- ReLU Activation’s Impact: Since all values in
were negative, the hidden layer output was completely zero, leading to no gradient flow back to and . - Numerical Stability: Always check intermediate values during calculations to avoid numerical issues, particularly with Sigmoid or softmax activations.
- Chain Rule in Backpropagation: Gradients are computed step-by-step using the chain rule, highlighting dependencies between layers.
Exercise 3: Data Preprocessing
Step 1: Apply Min-Max Scaling
Min-Max scaling scales each feature to a specific range, typically ([0, 1]). The formula is:
Given:
Step 1.1: Identify the minimum and maximum for each feature
- For feature 1:
, - For feature 2:
, - For feature 3:
,
Step 1.2: Apply the formula for each feature
For feature 1:
For feature 2:
For feature 3:
Step 1.3: Perform the calculations
Min-Max Scaled Dataset:
Step 2: Standardize Features
Standardization rescales features to have zero mean and unit variance. The formula is:
Step 2.1: Calculate the mean ( ) and standard deviation ( ) for each feature
For feature 1:
For feature 2:
For feature 3:
Step 2.2: Apply the formula for each feature
For feature 1:
For feature 2:
For feature 3:
Step 2.3: Perform the calculations
Standardized Dataset:
Step 3: Comparison of Approaches
Min-Max Scaling:
- Advantages:
- Ensures all features are in the same range (e.g., [0, 1]).
- Useful for algorithms sensitive to the scale of inputs (e.g., gradient descent, neural networks).
- Disadvantages:
- Sensitive to outliers, as extreme values can dominate the range and skew scaling.
Standardization:
- Advantages:
- Centers the data around zero and scales by variance, which is beneficial for distance-based algorithms (e.g., SVM, KNN).
- Less sensitive to outliers compared to Min-Max scaling.
- Disadvantages:
- Does not bound data to a specific range, which may not suit algorithms requiring normalized inputs.
When to Use:
- Min-Max Scaling: Use when the algorithm assumes data within a fixed range or when features have the same importance but different ranges (e.g., neural networks).
- Standardization: Use when the algorithm is distance-based or sensitive to variance and when outliers are present.
Final Results
Min-Max Scaled Dataset:
Standardized Dataset:
Additional Remarks for Students
- Always preprocess data before training a neural network, as inconsistent feature scales can hinder convergence.
- Experiment with both methods on a validation set to determine which works better for your specific model and dataset.
Exercise 4: Activation Functions
Step 1: Compute Outputs for Each Activation Function
ReLU (Rectified Linear Unit)
The ReLU function is defined as:
Given:
Compute ReLU for each value in
ReLU Outputs:
Leaky ReLU
The Leaky ReLU function introduces a small slope
Given:
Compute Leaky ReLU for each value in
Leaky ReLU Outputs:
Sigmoid
The Sigmoid function squashes inputs into the range ([0, 1]):
Given:
Compute Sigmoid for each value in
Sigmoid Outputs:
Tanh (Hyperbolic Tangent)
The Tanh function squashes inputs into the range ([-1, 1]):
Given:
Compute Tanh for each value in
Tanh Outputs:
Step 2: Sketch the Graphs
Graphs of the functions show their unique behavior:
- ReLU: Passes positive inputs directly; outputs 0 for negative inputs.
- Leaky ReLU: Similar to ReLU but allows a small negative slope for negative inputs.
- Sigmoid: Smoothly maps inputs into ([0, 1]), with a steep slope near 0.
- Tanh: Smoothly maps inputs into ([-1, 1]), with a steep slope near 0.
Step 3: Compare Advantages and Disadvantages
ReLU
- Advantages:
- Computationally efficient.
- Helps mitigate the vanishing gradient problem for positive inputs.
- Disadvantages:
- Suffers from the “dying ReLU” problem where some neurons output 0 and stop learning (no gradient flow).
Leaky ReLU
- Advantages:
- Addresses the “dying ReLU” problem by allowing small gradients for negative inputs.
- Simple modification of ReLU with minimal computational cost.
- Disadvantages:
- The slope parameter
may need to be tuned.
- The slope parameter
Sigmoid
- Advantages:
- Outputs are always between ([0, 1]), making it useful for binary classification.
- Differentiable and smooth.
- Disadvantages:
- Suffers from the vanishing gradient problem for large positive or negative inputs.
- Outputs not zero-centered, which may slow down convergence.
Tanh
- Advantages:
- Outputs are zero-centered, aiding convergence during training.
- Works well when inputs need to be scaled to ([-1, 1]).
- Disadvantages:
- Suffers from the vanishing gradient problem for large positive or negative inputs.
- Computationally more expensive than ReLU.
Final Results
- Outputs:
Input ( | ReLU | Leaky ReLU | Sigmoid | Tanh |
---|---|---|---|---|
(-2) | 0 | -0.02 | 0.119 | -0.964 |
(-1) | 0 | -0.01 | 0.269 | -0.761 |
0 | 0 | 0 | 0.5 | 0 |
1 | 1 | 1 | 0.731 | 0.761 |
2 | 2 | 2 | 0.881 | 0.964 |
Graph Descriptions:
- ReLU: Linear increase for
, flat at 0 for . - Leaky ReLU: Similar to ReLU but with a slight negative slope for
. - Sigmoid: S-shaped curve bounded between 0 and 1.
- Tanh: S-shaped curve bounded between -1 and 1.
- ReLU: Linear increase for
Usage Recommendations:
- ReLU: Default for most hidden layers in deep networks.
- Leaky ReLU: Use when ReLU suffers from the “dying ReLU” problem.
- Sigmoid: Suitable for binary outputs, particularly in the last layer.
- Tanh: Useful for zero-centered outputs, especially in intermediate layers.
Exercise 5: Gradient Checking
Step 1: Analytical Gradient Calculation
The loss function is:
The predicted value is:
The analytical gradient of
Step 1.1: Compute Partial Derivatives
The derivative of
with respect to :The derivative of
with respect to :
Step 1.2: Combine the Results
Step 1.3: Substitute Values
Given:
, , , ,
Compute
Substitute into the gradient formula:
Analytical Gradient:
Step 2: Numerical Gradient Approximation
The numerical approximation for the gradient is:
Step 2.1: Compute
Let
- For
:
Step 2.2: Compute
- For
:
Step 2.3: Compute the Numerical Gradient
Numerical Gradient:
Step 3: Comparison of Results
- Analytical Gradient:
- Numerical Gradient:
The results are identical, confirming that the analytical gradient is correct.
Step 4: Discussion
Why Perform Gradient Checking?
Gradient checking is a way to verify that the backpropagation implementation is correct by comparing the analytical gradients to numerical approximations.
Accuracy of the Results
The two results match, meaning the analytical gradient calculation and implementation of backpropagation for this simple case are likely correct.
Potential Differences
If there were a discrepancy between the two results, possible causes could include:
- Implementation Errors: Mistakes in the analytical gradient derivation or coding.
- Finite Differences Approximation: Numerical gradients are approximations and may suffer from rounding errors if
is too large or too small. - Non-smooth Functions: Numerical gradient approximation may struggle with non-smooth functions (e.g., ReLU).
Practical Advice
- Use gradient checking for debugging during the implementation phase, but avoid using it during regular training as it is computationally expensive.
- Ensure
is small enough to minimize approximation errors but not too small to cause numerical instability.
Final Results
Analytical Gradient:
Numerical Gradient:
Comparison:
- The results are identical, confirming correctness.
Key Takeaways:
- Gradient checking is a crucial debugging tool for verifying backpropagation.
- Analytical gradients are computationally efficient and precise, while numerical gradients are used as a validation benchmark.
Exercise 6: Regularization
Step 1: Compute the Weight Penalty Term
The L2 regularization penalty is given by:
Given:
Step 1.1: Compute
Step 1.2: Compute the penalty
Weight Penalty Term:
Step 2: Update the Weights Using Gradient Descent
In gradient descent with L2 regularization, the weight update rule is:
Assumptions for This Exercise
- The gradient of the loss with respect to weights (
) is not provided, so we’ll use as an example.
Step 2.1: Compute Regularization Term
The regularization term for each weight is:
Step 2.2: Update Each Weight
For each weight
For
:For
:For
:
Updated Weights:
Step 3: Discuss How Regularization Affects Model Training
Purpose of L2 Regularization
L2 regularization, also known as weight decay, adds a penalty proportional to the square of the weights to the loss function. This encourages the model to:
- Keep weights small.
- Reduce overfitting by discouraging reliance on large weights, which may cause the model to memorize the training data.
Impact on Weight Updates
- Smaller Weights: The regularization term
reduces the magnitude of the weights during updates. - Regularization Tradeoff: The penalty balances the model’s goal of minimizing the loss and keeping weights small. A larger
increases this effect, potentially leading to underfitting if too strong.
When to Use Regularization
- Use L2 Regularization When:
- Overfitting is a concern.
- Input features have varying scales (in conjunction with feature scaling).
- Avoid Overregularization:
- If
is too large, the model may fail to capture important patterns in the data.
- If
Final Results
Weight Penalty Term:
Updated Weights:
Key Takeaways:
- Regularization reduces overfitting by penalizing large weights.
- Weight updates incorporate both the gradient of the loss and the regularization term.
- Choosing the regularization strength
is crucial to balance bias and variance in the model.
Exercise 7: Neural Network Error Analysis
Step 1: Calculate the Loss
The binary cross-entropy loss is given by:
Given:
Step 1.1: Substitute Values
Step 1.2: Compute the Logarithm
Binary Cross-Entropy Loss:
Step 2: Determine Potential Issues with Weight Initialization
Weight initialization affects gradient flow, influencing whether gradients vanish (become too small) or explode (become too large). This can lead to slow learning or unstable training.
Step 2.1: Analyze Current Initialization
Weights
- Positive and negative values balance the signal, which is good.
- The range (from (-0.3) to (0.8)) could still introduce some instability during backpropagation, depending on the input scale.
Weights
Biases
Step 2.2: Analyze Activation Functions
ReLU in the Hidden Layer:
- ReLU outputs zero for negative inputs, which can cause the “dying ReLU” problem (where neurons output 0 and stop learning if weights and inputs align poorly).
- The weights in
could lead to this issue for certain inputs.
Sigmoid in the Output Layer:
- Sigmoid squashes outputs to ([0, 1]). For large inputs, its gradient is close to zero, which can contribute to the vanishing gradient problem during backpropagation.
Conclusion:
The current weight initialization and activation functions are not inherently problematic but could contribute to:
- Vanishing Gradients: Due to Sigmoid activation and potentially imbalanced weights.
- Dying ReLU Neurons: If initial weights and inputs lead to zero activations in the hidden layer.
Step 3: Propose Changes to Improve Performance
1. Adjust Weight Initialization
Why?
- Proper initialization ensures balanced gradients and stable training.
- Avoid very large or very small initial weights to prevent exploding or vanishing gradients.
Proposed Method:
- Use Xavier Initialization (also called Glorot Initialization):
- For weights
, initialize as: where and are the number of input and output neurons for the layer.
- For weights
- Alternatively, use He Initialization (specific for ReLU):
- For weights
, initialize as:
- For weights
2. Replace Sigmoid Activation in the Output Layer
Why?
- Sigmoid activations can cause vanishing gradients for large inputs.
Proposed Change:
- Use Softmax for multi-class classification or Tanh for zero-centered outputs. However, for binary classification:
- Retain Sigmoid but ensure scaled inputs.
3. Modify Network Architecture
Why?
- Deeper networks often benefit from regularization and additional layers.
Proposed Changes:
- Add Batch Normalization after the hidden layer to normalize activations, improving gradient flow and reducing dependence on initialization.
- Add Dropout to mitigate overfitting.
4. Experiment with Learning Rate
Why?
- A high learning rate can cause gradients to explode, while a low learning rate can lead to vanishing updates.
Proposed Change:
- Use a learning rate scheduler to adapt the learning rate during training.
Final Results
Loss Calculation:
Weight Initialization Analysis:
- Current weights and biases are moderately scaled but could still cause vanishing gradients (due to Sigmoid) or dying ReLU neurons.
- Issues are not severe but may worsen as the network becomes deeper.
Proposed Improvements:
- Adjust weight initialization with Xavier or He Initialization.
- Replace Sigmoid activation or normalize inputs to avoid saturation.
- Add Batch Normalization and Dropout for better gradient flow and reduced overfitting.
- Use a learning rate scheduler for stable training.
Key Takeaways:
- Proper initialization and architecture design are crucial to prevent vanishing or exploding gradients.
- Activation functions play a significant role in stabilizing training and improving model performance.