Machine Learning (ML) is a subset of artificial intelligence that focuses on enabling computers to learn from data and improve over time without explicit programming.
Understanding where machine learning fits within the evolution of computing and AI:
Machine learning has evolved from basic algorithms to sophisticated models shaping modern technology.
Machine learning can have profound social impacts. Key considerations include:
These issues are essential for deploying ML responsibly and building trust in AI systems.
While powerful, ML has inherent challenges:
Recognizing these limitations is crucial for effectively using ML in real-world applications.
Learning from labeled data to make predictions.
Discovering patterns in unlabeled data.
Learning by interacting with an environment to maximize cumulative rewards.
Assigning inputs to predefined categories.
Predicting a continuous numerical value.
Grouping similar data points without predefined labels.
Predicting user preferences to recommend items.
NumPy (Numerical Python): The foundation for machine learning in Python
ndarray
(N-dimensional array)import numpy as np
# Create feature matrix and labels
X = np.array(
[[1, 2, 3], # sample 1
[4, 5, 6], # sample 2
[7, 8, 9]] # sample 3
) # Feature matrix (3 samples, 3 features -> 9 elements)
y = np.array([0, 1, 1]) # Labels
# Convert types (common in ML preprocessing)
X = X.astype(float) # Convert to float for ML algorithms
# [[1. 2. 3.]
# [4. 5. 6.]
# [7. 8. 9.]]
⚠️ All the code examples of this course can be found here.
import numpy as np
# Matrix multiplication (common in neural networks)
weights = np.random.randn(3, 2)
layer_output = np.dot(X, weights)
# Feature scaling
X_normalized = (X - X.mean(axis=0)) / X.std(axis=0)
# Reshape for mini-batches
batch_size = 2
X_batches = X.reshape(-1, batch_size, X.shape[1])
np.exp()
for softmaxnp.log()
for cross-entropynp.mean()
, np.sum()
import numpy as np
# Softmax activation
def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# Binary cross-entropy calculation
def binary_cross_entropy(y_true, y_pred):
return -np.mean(y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred))
Pandas: Essential for ML data preprocessing and feature engineering
import pandas as pd
import numpy as np
# Load and prepare ML dataset
df = pd.DataFrame({
'feature1': [1, 2, np.nan, 4],
'feature2': ['A', 'B', 'A', 'C'],
'target': [0, 1, 1, 0]
})
# Handle missing values
df['feature1'].fillna(df['feature1'].mean(), inplace=True)
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['feature2'])
import pandas as pd
# Feature engineering examples
df['interaction'] = df['feature1'] * df['feature2']
# Stratified sampling for train/test split
train_df = df.sample(frac=0.8, stratify=df['target'])
test_df = df.drop(train_df.index)
# Statistical features
df['rolling_mean'] = df['feature1'].rolling(window=3).mean()
import pandas as pd
import numpy as np
# Convert DataFrame to NumPy arrays
X = df_encoded.drop('target', axis=1).to_numpy()
y = df_encoded['target'].to_numpy()
# Final scaling for ML
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Ready for ML algorithms
print("Feature matrix shape:", X_scaled.shape)
print("Target vector shape:", y.shape)
Use select_dtypes
to filter columns by data type:
import pandas as pd
# Create a sample DataFrame
data = pd.DataFrame({
'numerical': [1, 2, 3],
'categorical': ['A', 'B', 'C'],
'boolean': [True, False, True]
})
# Select only numerical columns
numerical_data = data.select_dtypes(include=['number'])
# Select only categorical columns
categorical_data = data.select_dtypes(include=['object'])
print("Numerical Columns:\n", numerical_data)
print("Categorical Columns:\n", categorical_data)
Why? Useful for applying operations to specific types of data (e.g., scaling numerical features).
median()
: Compute the median (middle value).std()
: Calculate standard deviation (measure of spread).np.abs()
: Compute absolute values of numerical data.import pandas as pd
import numpy as np
# Sample data
data = pd.DataFrame({'values': [-10, 20, -30, 40, -50]})
# Median
median_value = data['values'].median()
# Standard deviation
std_value = data['values'].std()
# Absolute values
absolute_values = np.abs(data['values'])
print("Median:", median_value)
print("Standard Deviation:", std_value)
print("Absolute Values:\n", absolute_values)
Why? These functions are essential for understanding data distributions and normalizing values.
Use df.copy()
to create a true (deep) copy of a DataFrame:
import pandas as pd
# Sample DataFrame
data = pd.DataFrame({'values': [1, 2, 3]})
# Shallow copy (linked to original)
shallow_copy = data
# Deep copy (independent of original)
deep_copy = data.copy()
# Modify original
data.loc[0, 'values'] = 999
print("Original:\n", data)
print("Shallow Copy:\n", shallow_copy) # Changes with original
print("Deep Copy:\n", deep_copy) # Stays unchanged
Why? Use df.copy()
to avoid unintended modifications to the original
DataFrame.
Use pd.concat
to combine DataFrames vertically or horizontally:
import pandas as pd
# Sample DataFrames
data1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
data2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenate vertically (default)
vertical_concat = pd.concat([data1, data2])
# Concatenate horizontally
horizontal_concat = pd.concat([data1, data2], axis=1)
print("Vertical Concatenation:\n", vertical_concat)
print("Horizontal Concatenation:\n", horizontal_concat)
Why? pd.concat
is ideal for combining datasets during preprocessing.
Probabilities quantify the uncertainty in predictions.
A random variable is a variable whose value is subject to variations due to chance.
import numpy as np
# Sample random variable data
random_variable = np.array([1, 2, 3, 4, 5])
# Expected Value (Mean)
expected_value = np.mean(random_variable)
# Variance
variance = np.var(random_variable)
# Standard Deviation
std_dev = np.std(random_variable)
print(f"Expected Value (Mean): {expected_value}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
Why is it important? In probability distributions, standard deviation quantifies the uncertainty:
The probability of an event A given that event B has occurred.
# Example of Conditional Probability
# P(A|B) = P(A and B) / P(B)
p_a_and_b = 0.3
p_b = 0.6
p_a_given_b = p_a_and_b / p_b
print(f"P(A|B): {p_a_given_b}")
# Bayes' Theorem Example
def bayes_theorem(p_a, p_b_given_a, p_b):
return (p_b_given_a * p_a) / p_b
# Inputs
p_a = 0.2 # Prior probability of A
p_b_given_a = 0.8 # Likelihood of B given A
p_b = 0.5 # Marginal probability of B
# Posterior
p_a_given_b = bayes_theorem(p_a, p_b_given_a, p_b)
print(f"P(A|B): {p_a_given_b}")
While Naive Bayes is simple and effective, it relies on certain assumptions:
A confidence interval quantifies the range within which a parameter lies with a certain probability.
# Confidence Interval Example
import scipy.stats as stats
data = [1, 2, 3, 4, 5]
mean = np.mean(data)
conf_interval = stats.norm.interval(0.95, loc=mean, scale=np.std(data)/np.sqrt(len(data)))
print(f"95% Confidence Interval: {conf_interval}")
Revisiting these topics will help you better understand machine learning concepts and algorithms:
Consider reviewing these areas if they feel unfamiliar. They are integral to ML concepts and algorithms!
Ensuring data quality is critical to model performance.
dropna()
, fillna()
,
duplicated()
isnull()
and sum()
in Pandas.# Identify missing values
missing_values = data.isnull().sum()
# Drop rows with missing values
data_clean = data.dropna()
# Impute missing values with mean
data['column'] = data['column'].fillna(data['column'].mean())
duplicated()
and drop_duplicates()
in Pandas.# Identify duplicates
duplicates = data.duplicated()
# Remove duplicates
data_clean = data.drop_duplicates()
Scaling ensures features contribute equally to the model.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Normalization
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data)
# Standardization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
Categorical data must be converted to numerical format for machine learning algorithms.
# One-Hot Encoding with Pandas
data_encoded = pd.get_dummies(data, columns=['categorical_column'])
# Label Encoding with Scikit-learn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['category_encoded'] = le.fit_transform(data['categorical_column'])
Feature engineering involves creating, transforming, or selecting features to improve model performance.
total_price = quantity * unit_price
).
Data visualization helps to understand data distributions, detect patterns, and spot anomalies.
# Histogram using Matplotlib
import matplotlib.pyplot as plt
plt.hist(data['numerical_column'], bins=30)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Scatter Plot using Seaborn
import seaborn as sns
sns.scatterplot(x='feature1', y='feature2', data=data)
plt.title('Feature1 vs. Feature2')
plt.show()
# Box Plot using Seaborn
sns.boxplot(x='categorical_column', y='numerical_column', data=data)
plt.title('Numerical Column by Category')
plt.show()