Paris-Saclay University

Introduction to Machine Learning

Foundations

Summary

  • What is Machine Learning?
  • Types of Machine Learning Tasks
  • Numpy and Pandas
  • Fundamentals of Probability in ML
  • Workflow and Data Preprocessing Techniques
  • Feature Engineering
  • Data Visualization Techniques
  • Conclusion
  • Glossary

What is Machine Learning?

What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that focuses on enabling computers to learn from data and improve over time without explicit programming.

  • Learning from data
  • Making predictions or decisions
  • Improving performance through experience

Why Machine Learning?

  • Motivation: Handling large volumes of data, automating model building, uncovering hidden patterns

Real-world Applications

Healthcare: Predicting patient outcomes, disease prediction
Finance: Algorithmic trading, fraud detection
E-commerce: Personalized recommendations
Transportation: Self-driving cars
Social Media: Content curation, recommendations
Technology: voice assistants

Historical Context and Evolution of ML

Understanding where machine learning fits within the evolution of computing and AI:

  • 1950s: Alan Turing proposes the idea of machines that learn (Turing Test).
  • 1960s-1980s: Early rule-based systems and symbolic AI.
  • 1980s-1990s: Rise of statistical approaches (e.g., decision trees, SVMs).
  • 2000s: Explosion of data (Big Data) and computational power enable larger models.
  • 2010s: Deep learning revolutionizes fields like image recognition and NLP.
  • Today: ML powers applications in almost every industry, from healthcare to finance.

Machine learning has evolved from basic algorithms to sophisticated models shaping modern technology.

Ethical and Social Considerations in ML

Machine learning can have profound social impacts. Key considerations include:

  • Fairness: Ensuring ML models do not discriminate against certain groups.
  • Bias: Recognizing and mitigating biases in training data and algorithms.
  • Privacy: Protecting user data when training and deploying ML models.
  • Transparency: Making models interpretable and decisions explainable.
  • Accountability: Determining who is responsible for the outcomes of ML systems.

These issues are essential for deploying ML responsibly and building trust in AI systems.

Limitations of Machine Learning

While powerful, ML has inherent challenges:

  • Data Dependency: ML models require high-quality, large-scale data.
  • Interpretability: Complex models (e.g., deep learning) can be hard to understand.
  • Overfitting: Models may perform well on training data but fail to generalize.
  • Resource Intensive: Training large models can be computationally and energy expensive.
  • Limited Generalization: ML struggles with tasks outside its training data (e.g., edge cases).

Recognizing these limitations is crucial for effectively using ML in real-world applications.

Types of Machine Learning Tasks

Overview of ML Tasks

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Supervised Learning

Learning from labeled data to make predictions.

  • Types:
    • Classification: Predict categorical outcomes
    • Regression: Predict continuous outcomes
  • Examples: Spam detection, stock price prediction

Unsupervised Learning

Discovering patterns in unlabeled data.

  • Types:
    • Clustering
    • Dimensionality Reduction
  • Examples: Customer segmentation, anomaly detection

Reinforcement Learning

Learning by interacting with an environment to maximize cumulative rewards.

  • Core Elements:
    • Agent: The decision-maker
    • Actions: Choices made by the agent
    • Rewards: Feedback for actions
  • Examples: Game playing (e.g., AlphaGo), robotics, self-driving cars

Classification

Assigning inputs to predefined categories.

  • Use Cases: Email spam vs. not spam, image recognition (e.g., cats vs. dogs)
decision boundary for classification

Regression

Predicting a continuous numerical value.

  • Use Cases: House price prediction, forecasting sales
scatter plot with a regression line

Clustering

Grouping similar data points without predefined labels.

  • Use Cases: Market segmentation, document classification
scatter plot showing clusters

Recommendation Systems

Predicting user preferences to recommend items.

  • Types:
    • Collaborative Filtering
    • Content-based Filtering
  • Examples: logo de Deezer     Image illustrative de l'article Netflix     logo de Amazon

Numpy and Pandas

Introduction to NumPy

NumPy (Numerical Python): The foundation for machine learning in Python

  • Core data structure: ndarray (N-dimensional array)
  • Essential features for ML:
    • Matrix operations for neural networks
    • Efficient numerical computations
    • Statistical functions for data preprocessing
    • Random sampling for train/test splits
    • Linear algebra for feature transformations
  • Integration with major ML libraries (scikit-learn, TensorFlow, PyTorch)

ndarray: The Building Block of ML

  • Why crucial for ML:
    • Efficient storage of large datasets
    • Fast matrix operations for model training
    • Memory-efficient data types for large-scale ML
import numpy as np
# Create feature matrix and labels
X = np.array(
    [[1, 2, 3], # sample 1
     [4, 5, 6], # sample 2
     [7, 8, 9]] # sample 3
)  # Feature matrix (3 samples, 3 features -> 9 elements)
y = np.array([0, 1, 1])  # Labels
# Convert types (common in ML preprocessing)
X = X.astype(float)  # Convert to float for ML algorithms
# [[1. 2. 3.]
#  [4. 5. 6.]
#  [7. 8. 9.]]

⚠️ All the code examples of this course can be found here.

Essential NumPy Operations for ML

  • Matrix operations for neural networks
  • Statistical operations for feature scaling
  • Shape manipulation for batch processing
import numpy as np
# Matrix multiplication (common in neural networks)
weights = np.random.randn(3, 2)
layer_output = np.dot(X, weights)
# Feature scaling
X_normalized = (X - X.mean(axis=0)) / X.std(axis=0)
# Reshape for mini-batches
batch_size = 2
X_batches = X.reshape(-1, batch_size, X.shape[1])

NumPy Universal Functions for ML

  • Essential operations for model implementation:
    • Activation functions: np.exp() for softmax
    • Loss calculations: np.log() for cross-entropy
    • Metrics: np.mean(), np.sum()
import numpy as np
# Softmax activation
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# Binary cross-entropy calculation
def binary_cross_entropy(y_true, y_pred):
    return -np.mean(y_true * np.log(y_pred) + 
                   (1 - y_true) * np.log(1 - y_pred))

Introduction to Pandas for ML

Pandas: Essential for ML data preprocessing and feature engineering

  • Key ML applications:
    • Loading and cleaning datasets
    • Feature engineering and selection
    • Handling missing values
    • Categorical variable encoding

DataFrames: ML Data Preparation

import pandas as pd
import numpy as np
# Load and prepare ML dataset
df = pd.DataFrame({
    'feature1': [1, 2, np.nan, 4],
    'feature2': ['A', 'B', 'A', 'C'],
    'target': [0, 1, 1, 0]
})
# Handle missing values
df['feature1'].fillna(df['feature1'].mean(), inplace=True)
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['feature2'])

ML-Specific Pandas Operations

  • Feature engineering techniques:
    • Creating interaction features
    • Time-based feature extraction
    • Statistical aggregations
import pandas as pd
# Feature engineering examples
df['interaction'] = df['feature1'] * df['feature2']
# Stratified sampling for train/test split
train_df = df.sample(frac=0.8, stratify=df['target'])
test_df = df.drop(train_df.index)
# Statistical features
df['rolling_mean'] = df['feature1'].rolling(window=3).mean()

From Pandas to NumPy for ML

  • Converting preprocessed data to ML-ready format
  • Splitting features and targets
  • Final preprocessing steps
import pandas as pd
import numpy as np
# Convert DataFrame to NumPy arrays
X = df_encoded.drop('target', axis=1).to_numpy()
y = df_encoded['target'].to_numpy()
# Final scaling for ML
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Ready for ML algorithms
print("Feature matrix shape:", X_scaled.shape)
print("Target vector shape:", y.shape)

Selecting Specific Data Types

Use select_dtypes to filter columns by data type:

import pandas as pd
# Create a sample DataFrame
data = pd.DataFrame({
    'numerical': [1, 2, 3],
    'categorical': ['A', 'B', 'C'],
    'boolean': [True, False, True]
})
# Select only numerical columns
numerical_data = data.select_dtypes(include=['number'])
# Select only categorical columns
categorical_data = data.select_dtypes(include=['object'])
print("Numerical Columns:\n", numerical_data)
print("Categorical Columns:\n", categorical_data)

Why? Useful for applying operations to specific types of data (e.g., scaling numerical features).

Summary Statistics and Absolute Values

  • median(): Compute the median (middle value).
  • std(): Calculate standard deviation (measure of spread).
  • np.abs(): Compute absolute values of numerical data.
import pandas as pd
import numpy as np
# Sample data
data = pd.DataFrame({'values': [-10, 20, -30, 40, -50]})
# Median
median_value = data['values'].median()
# Standard deviation
std_value = data['values'].std()
# Absolute values
absolute_values = np.abs(data['values'])
print("Median:", median_value)
print("Standard Deviation:", std_value)
print("Absolute Values:\n", absolute_values)

Why? These functions are essential for understanding data distributions and normalizing values.

Deep Copy vs. Shallow Copy

Use df.copy() to create a true (deep) copy of a DataFrame:

import pandas as pd
# Sample DataFrame
data = pd.DataFrame({'values': [1, 2, 3]})
# Shallow copy (linked to original)
shallow_copy = data
# Deep copy (independent of original)
deep_copy = data.copy()
# Modify original
data.loc[0, 'values'] = 999
print("Original:\n", data)
print("Shallow Copy:\n", shallow_copy)  # Changes with original
print("Deep Copy:\n", deep_copy)        # Stays unchanged

Why? Use df.copy() to avoid unintended modifications to the original DataFrame.

Combining DataFrames

Use pd.concat to combine DataFrames vertically or horizontally:

import pandas as pd
# Sample DataFrames
data1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
data2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenate vertically (default)
vertical_concat = pd.concat([data1, data2])
# Concatenate horizontally
horizontal_concat = pd.concat([data1, data2], axis=1)
print("Vertical Concatenation:\n", vertical_concat)
print("Horizontal Concatenation:\n", horizontal_concat)

Why? pd.concat is ideal for combining datasets during preprocessing.

Scikit-learn Basics

  • Preprocessing Tools: Imputation, scaling, encoding.
  • Algorithms: Classification, regression, clustering.
  • Model Evaluation: Cross-validation, evaluation metrics.

Fundamentals of Probability in ML

Importance of Probability
in ML

Probabilities quantify the uncertainty in predictions.

  • Understanding uncertainty
  • Working with probability distributions
  • Calculating conditional probabilities

Random Variables
and Distributions

A random variable is a variable whose value is subject to variations due to chance.

  • Discrete: Binomial distribution
  • Continuous: Normal distribution
Normal Distribution PDF.svg
Probability mass function for the binomial distribution

Expected Value, Variance, and Standard Deviation

  • Expected Value (Mean): Average outcome of a random variable.
  • Variance: Measure of how much values differ from the mean.
  • Standard Deviation:
    • Square root of variance.
    • Indicates the spread of data around the mean.
    • Useful for understanding the uncertainty in probability distributions.
import numpy as np
# Sample random variable data
random_variable = np.array([1, 2, 3, 4, 5])
# Expected Value (Mean)
expected_value = np.mean(random_variable)
# Variance
variance = np.var(random_variable)
# Standard Deviation
std_dev = np.std(random_variable)
print(f"Expected Value (Mean): {expected_value}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

Why is it important? In probability distributions, standard deviation quantifies the uncertainty:

  • Low standard deviation: Data points are close to the mean (narrow spread).
  • High standard deviation: Data points are widely spread (greater uncertainty).

Conditional Probability

The probability of an event A given that event B has occurred.

  • Formula: \( P(A|B) = \frac{P(A \cap B)}{P(B)} \)
  • Key Concepts:
    • Joint Probability: \( P(A \cap B) \): The likelihood of A and B happening together.
    • Marginal Probability: \( P(B) \): The likelihood of event B happening.
  • Application in ML:
    • Naive Bayes classifier
    • Bayesian networks
    • Predictive models with probabilistic outputs
  • Assumptions in ML:
    • Naive Bayes assumes conditional independence among features.
    • Bayesian networks capture conditional dependencies.
# Example of Conditional Probability
# P(A|B) = P(A and B) / P(B)
p_a_and_b = 0.3
p_b = 0.6
p_a_given_b = p_a_and_b / p_b
print(f"P(A|B): {p_a_given_b}")

Joint and Marginal Probabilities

  • Joint Probability: The probability of two events occurring simultaneously.
    • \( P(A \cap B) \): Probability of both A and B.
    • Example: Probability of rain and carrying an umbrella.
  • Marginal Probability: The probability of a single event occurring, irrespective of others.
    • \( P(B) \): Probability of B happening.
    • Example: Probability of rain regardless of carrying an umbrella.
Multivariate normal sample

Bayes' Theorem

  • Formula: \( P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \)
  • Key Terms:
    • \( P(A) \): Prior probability (before observing B).
    • \( P(B|A) \): Likelihood of B given A.
    • \( P(A|B) \): Posterior probability (updated belief after observing B).
  • Use Cases:
    • Updating beliefs with new evidence
    • Spam filtering
    • Medical diagnosis
  • Intuition: Think of Bayes' theorem like updating your belief about the weather (event A) after looking at the sky (evidence B).
# Bayes' Theorem Example
def bayes_theorem(p_a, p_b_given_a, p_b):
    return (p_b_given_a * p_a) / p_b
# Inputs
p_a = 0.2  # Prior probability of A
p_b_given_a = 0.8  # Likelihood of B given A
p_b = 0.5  # Marginal probability of B
# Posterior
p_a_given_b = bayes_theorem(p_a, p_b_given_a, p_b)
print(f"P(A|B): {p_a_given_b}")

Naive Bayes Assumptions

While Naive Bayes is simple and effective, it relies on certain assumptions:

  • Conditional Independence:
    • Assumes features are conditionally independent given the target label.
    • Rarely holds true in real-world datasets (e.g., word dependencies in text).
  • Class Prior Accuracy:
    • Depends on accurate prior probabilities (\( P(A) \)) for each class.
    • Biased or imbalanced data can lead to poor performance.
  • Sensitivity to Feature Representation:
    • Performance depends on appropriate feature engineering.
    • Examples: Word frequencies in text, categorical encoding in structured data.
  • Key Insight:
    • Naive Bayes performs surprisingly well for high-dimensional data like text classification, even when the independence assumption does not hold, thanks to averaging effects across features.

Entropy and Information Gain

  • Entropy: A measure of uncertainty or randomness.
    • Formula: \( H(X) = -\sum P(x) \log_2 P(x) \)
    • Example: High entropy in a coin flip (50-50), low entropy in biased coin (90-10).
  • Information Gain: Reduction in entropy after splitting the data.
  • Use:
    • Decision Trees:
      • Entropy measures the uncertainty in a dataset.
      • Information gain guides feature selection for splits.
    • Clustering:
      • Entropy measures the purity of clusters (e.g., in k-means).
      • Helps evaluate cluster quality during initialization or refinement.
    • Uncertainty in Predictions:
      • Entropy measures model confidence in classification tasks.
      • Used in probabilistic outputs like softmax in neural networks.

Confidence Intervals

A confidence interval quantifies the range within which a parameter lies with a certain probability.

  • Helps understand prediction reliability.
  • Widely used in regression and probabilistic models.
# Confidence Interval Example
import scipy.stats as stats
data = [1, 2, 3, 4, 5]
mean = np.mean(data)
conf_interval = stats.norm.interval(0.95, loc=mean, scale=np.std(data)/np.sqrt(len(data)))
print(f"95% Confidence Interval: {conf_interval}")

Key Mathematical Foundations for ML

Revisiting these topics will help you better understand machine learning concepts and algorithms:

  • Linear Algebra:
    • Vectors, matrices, and matrix operations
    • Eigenvalues and eigenvectors
    • Applications in dimensionality reduction (e.g., PCA)
  • Calculus:
    • Derivatives and gradients
    • Optimization techniques (e.g., gradient descent)
    • Applications in neural networks and backpropagation
  • Probability and Statistics:
    • Probability distributions (normal, binomial)
    • Conditional probability and Bayes' theorem
    • Applications in probabilistic models (e.g., Naive Bayes)

Consider reviewing these areas if they feel unfamiliar. They are integral to ML concepts and algorithms!

Key Takeaways

  • Probability quantifies uncertainty and is central to ML predictions.
  • Random variables and distributions underpin probabilistic models.
  • Bayes' theorem updates beliefs and powers algorithms like Naive Bayes.
  • Entropy measures uncertainty; information gain drives decision tree splits.

Workflow and Data Preprocessing Techniques

ML workflow diagram

ML workflow diagram

Data Collection

  • Sources:
    • Databases
    • APIs
    • Web scraping
  • Considerations:
    • Data quality
    • Volume and variety
    • Legal and ethical issues

Data Preprocessing

Ensuring data quality is critical to model performance.

  • Data cleaning
  • Preparation for analysis
  • Improving accuracy

Data Cleaning

  • Tasks:
    • Handling missing values
    • Removing duplicates
    • Correcting errors
  • Tools: Pandas functions like dropna(), fillna(), duplicated()

Handling Missing Values

  • Identify Missing Data: Use isnull() and sum() in Pandas.
  • Strategies:
    • Deletion: Listwise (drop rows), Pairwise (drop specific values)
    • Imputation: Mean/Median/Mode replacement, Forward/Backward fill, Interpolation

Missing Values - Code Example

# Identify missing values
missing_values = data.isnull().sum()
# Drop rows with missing values
data_clean = data.dropna()
# Impute missing values with mean
data['column'] = data['column'].fillna(data['column'].mean())

Removing Duplicates

  • Why? Duplicates can skew analysis and inflate model performance.
  • Tools: Use duplicated() and drop_duplicates() in Pandas.
# Identify duplicates
duplicates = data.duplicated()
# Remove duplicates
data_clean = data.drop_duplicates()

Feature Scaling

Scaling ensures features contribute equally to the model.

  • Normalization (Min-Max Scaling): Rescales features to [0, 1].
    • Common in initialization (e.g., weights in neural networks).
  • Standardization (Z-score Scaling): Centers features around mean 0 with standard deviation 1.
    • Useful in classification tasks or modeling binary outcomes.

Feature Scaling - Code Example

from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Normalization
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data)
# Standardization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

Encoding Categorical Variables

Categorical data must be converted to numerical format for machine learning algorithms.

  • Label Encoding: Assigns a unique number to each category.
  • One-Hot Encoding: Creates binary columns for each category.

Encoding - Code Example

# One-Hot Encoding with Pandas
data_encoded = pd.get_dummies(data, columns=['categorical_column'])
# Label Encoding with Scikit-learn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['category_encoded'] = le.fit_transform(data['categorical_column'])

Algorithm cheat sheet

Algorithm cheat sheet

Feature Engineering

What is Feature Engineering?

Feature engineering involves creating, transforming, or selecting features to improve model performance.

  • Improves model accuracy
  • Reduces complexity

Feature Engineering Techniques

  • Feature Creation: Creating new features based on domain knowledge
    (e.g., total_price = quantity * unit_price).
  • Feature Transformation: Applying transformations to handle skewed data (e.g., log transformation).
  • Feature Selection: Removing irrelevant or redundant features.

Feature Engineering - Examples

  • Datetime Features: Extract day, month, year, or weekday from timestamps.
  • Text Data: Convert text to numerical vectors using techniques like TF-IDF.

Data Visualization Techniques

Importance of Data Visualization

Data visualization helps to understand data distributions, detect patterns, and spot anomalies.

  • Understand data distributions
  • Identify patterns and trends
  • Detect outliers

Common Visualization Plots

  • Histogram: Shows frequency distribution of a variable.
  • Scatter Plot: Visualizes the relationship between two numerical variables.
  • Box Plot: Displays summary statistics and outliers. Also named whisker plot.
Histogram, Scatter Plot, Box Plot

Visualization Libraries

  • Matplotlib: Basic plotting library.
  • Seaborn: Built on Matplotlib with enhanced features for complex visualizations.

Histogram - Code Example

# Histogram using Matplotlib
import matplotlib.pyplot as plt
plt.hist(data['numerical_column'], bins=30)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Scatter Plot - Code Example

# Scatter Plot using Seaborn
import seaborn as sns
sns.scatterplot(x='feature1', y='feature2', data=data)
plt.title('Feature1 vs. Feature2')
plt.show()

Box Plot - Code Example

# Box Plot using Seaborn
sns.boxplot(x='categorical_column', y='numerical_column', data=data)
plt.title('Numerical Column by Category')
plt.show()
Box plot with details

Conclusion

Key Takeaways

  • Machine Learning: Core principles include learning from data, making predictions, and improving with experience.
  • Types of ML Tasks: Supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), and reinforcement learning.
  • Tools: Mastery of libraries like NumPy, Pandas, and Scikit-learn is critical for data manipulation, preprocessing, and model building.
  • Probability in ML: Probability fundamentals are essential for understanding uncertainty, distributions, and algorithms like Naive Bayes.
  • Data Preprocessing: Techniques such as cleaning, scaling, encoding, and feature engineering significantly impact model performance.
  • Visualization: Effective data visualization with libraries like Matplotlib and Seaborn aids in understanding data patterns and distributions.
  • Workflow: A structured ML workflow—from problem definition to deployment—ensures efficiency and scalability.
  • Real-world Applications: ML impacts diverse domains such as healthcare, finance, e-commerce, and transportation.

Resources and Further Reading

  • Books:
    • Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka et al.
    • The Hundred-Page Machine Learning Book by Andriy Burkov
  • Online Tutorials: Pandas documentation, Scikit-learn tutorials
  • Documentation: Official library documentation

Glossary

General Concepts

  • Machine Learning (ML): A branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.
  • Artificial Intelligence (AI): The simulation of human intelligence in machines that are programmed to think and learn.
  • Deep Learning: A subset of ML that uses neural networks with many layers (deep neural networks) to model and solve complex problems.
  • Supervised Learning: A type of ML where the model is trained on labeled data to predict outputs for new inputs.
  • Unsupervised Learning: A type of ML that deals with unlabeled data to identify hidden patterns or structures.
  • Reinforcement Learning: A learning paradigm where an agent learns by interacting with its environment and receiving feedback in the form of rewards or penalties.
  • Model: A mathematical representation or algorithm trained on data to make predictions or decisions.
  • Algorithm: A set of rules or processes followed in problem-solving or computations, such as gradient descent or decision trees.
  • Standard Deviation: A measure of the spread of data around the mean, calculated as the square root of the variance. It quantifies uncertainty in probability distributions and is widely used in data analysis and ML.
  • Bias: Systematic error that skews results in one direction, often due to flawed assumptions.
  • Variance: The variability of model predictions for different datasets, contributing to overfitting if too high.

Data and Features

  • Dataset: A collection of data used to train and evaluate ML models.
  • Sample: A single data point or instance from a dataset used for analysis, training, or evaluation of a machine learning model. Samples collectively form the dataset.
  • Feature: An individual measurable property or characteristic of a data point used as input to an ML model.
  • Label: The output variable in supervised learning that the model tries to predict.
  • Feature Engineering: The process of selecting, transforming, and creating features from raw data to improve model performance.
  • Feature Scaling: Techniques to standardize the range of features, such as normalization or standardization.
  • Training Set: A subset of the dataset used to train the model.
  • Test Set: A subset of the dataset used to evaluate the model's performance.
  • Validation Set: A subset of the dataset used during training to tune model parameters and prevent overfitting.
  • Data Augmentation: Techniques to increase the size of a dataset by generating new data points based on existing data.
  • Outlier: A data point that significantly deviates from the rest of the dataset, potentially affecting analysis and model performance.

Model Evaluation

  • Accuracy: The ratio of correctly predicted instances to the total instances in the dataset.
  • Precision: The ratio of true positives to the sum of true positives and false positives.
  • Recall: The ratio of true positives to the sum of true positives and false negatives.
  • F1 Score: The harmonic mean of precision and recall.
  • Confusion Matrix: A table used to evaluate the performance of a classification algorithm, showing true positives, true negatives, false positives, and false negatives.
  • ROC Curve: A graphical representation of a model's performance across different thresholds.
  • AUC (Area Under the Curve): The area under the ROC curve, representing the model's ability to distinguish between classes.
  • Cross-Validation: A technique to assess the model's performance by splitting the data into training and testing sets multiple times.

Optimization and Training

  • Gradient Descent: An optimization algorithm used to minimize the loss function by updating model parameters iteratively.
  • Gradient Vanishing: A problem where gradients become too small during backpropagation, slowing or stopping learning in deep networks.
  • Gradient Exploding: A problem where gradients become excessively large, leading to unstable training.
  • Loss Function: A function that measures the difference between the predicted outputs and the true labels.
  • Learning Rate: A hyperparameter that determines the step size in the gradient descent algorithm.
  • Overfitting: When a model learns the training data too well, including noise, leading to poor generalization.
  • Underfitting: When a model is too simple and fails to capture the underlying patterns in the data.
  • Regularization: Techniques like L1 or L2 to prevent overfitting by adding a penalty to the loss function.
  • Epoch: One complete pass through the entire training dataset.
  • Batch Size: The number of samples processed before the model's internal parameters are updated.
  • Early Stopping: A method to stop training when the performance on the validation set stops improving.

Algorithms and Models

  • Linear Regression: A supervised learning algorithm for predicting continuous outputs by fitting a linear relationship between input and output.
  • Logistic Regression: A supervised learning algorithm for binary classification problems.
  • Decision Tree: A tree-like model used for classification or regression tasks.
  • Random Forest: An ensemble method using multiple decision trees to improve performance and reduce overfitting.
  • Support Vector Machine (SVM): A supervised learning algorithm that separates data into classes using a hyperplane.
  • K-Nearest Neighbors (KNN): A simple algorithm that classifies data points based on the majority class of their k-nearest neighbors.
  • K-Means Clustering: An unsupervised learning algorithm that partitions data into k clusters.
  • Principal Component Analysis (PCA): A dimensionality reduction technique that transforms data into a lower-dimensional space.
  • Neural Network: A set of algorithms modeled after the human brain, consisting of layers of interconnected nodes (neurons).
  • Gaussian Distribution (Normal Distribution): A common probability distribution with a bell-shaped curve, characterized by its mean and standard deviation.

Deep Learning Specific Terms

  • Activation Function: Functions like ReLU or sigmoid that determine the output of a neuron in a neural network.
  • Backpropagation: A method for training neural networks by calculating the gradient of the loss function with respect to weights.
  • Convolutional Neural Network (CNN): A type of neural network designed for image data.
  • Recurrent Neural Network (RNN): A type of neural network designed for sequential data like time series or text.
  • Dropout: A regularization technique where randomly selected neurons are ignored during training.
  • Batch Normalization: A technique to stabilize and accelerate the training of deep neural networks.

Advanced Topics

  • Transfer Learning: A technique where a pre-trained model is adapted to a new but similar task.
  • Ensemble Learning: Combining multiple models to improve overall performance.
  • Bayesian Networks: Probabilistic graphical models representing variables and their dependencies.
  • Markov Decision Process (MDP): A mathematical framework for modeling decision-making in environments with stochastic outcomes.
  • Autoencoder: A neural network used to learn efficient representations of data, typically for dimensionality reduction.
  • Generative Adversarial Network (GAN): A framework where two networks (generator and discriminator) compete to improve each other's performance.
  • Attention Mechanism: A technique in neural networks that focuses on the most relevant parts of the input.

Practical Terms

  • Hyperparameter Tuning: The process of finding the optimal settings for a model's hyperparameters.
  • Pipeline: A sequence of data preprocessing and model training steps.
  • Exploratory Data Analysis (EDA): The process of analyzing datasets to summarize their main characteristics.
  • Reproducibility: The ability to consistently reproduce the same results using the same methodology and data.
  • Explainability: Techniques and methods to make ML model predictions interpretable and understandable.
  • Data Imputation: Techniques for replacing missing values with estimates like the mean, median, or predicted values.
To get the PDF of these slides and print them, click here and then use the PDF printer of your browser.