PCA Explained: A Visual Guide to Dimensionality Reduction

Principal Component Analysis (PCA) is a powerful machine learning technique used for dimensionality reduction. At its core, PCA leverages fundamental concepts from linear algebra and statistics to transform a complex dataset into a simpler, lower-dimensional form without losing crucial information. It helps us reduce the number of variables (or features) in a dataset while trying to preserve as much of the important information—specifically, the variance—as possible.

This is crucial when dealing with high-dimensional data, like images, where each pixel can be considered a feature. Let's walk through the process of applying PCA to a dataset of dog images, highlighting the key mathematical operations along the way.

About the Data: The data for this guide was sourced from Kaggle's Animal Faces (AFHQ) dataset. We are using a subset of 85 dog images, which have been converted to grayscale and resized to a uniform 64x64 pixel format. The full project code is available on GitHub.

Step 1: Loading the Data

First, we need to load our dataset. Each image is a matrix of pixel values.

Action: Load the images from the directory and print the dataset size.

import sys
sys.path.append('..')
import numpy as np
from src.util import load_images
images = load_images("../data/dogs-for-pca-processed/pca/")
height,width = images[0].shape
print(f"Dataset has {len(images)} images of size {height}x{width}")

import matplotlib.pyplot as plt
plt.imshow(images[0], cmap='gray')

Output:

Dataset has 85 images of size 64x64

Visualization:

Step 2: Flattening the Image Data

Machine learning models typically expect data in a 2D matrix format: (number of samples, number of features). We "flatten" each 64x64 image into a single 1x4096 row vector. This transforms our list of images into a single data matrix.

Action: Reshape the image data.

images_flattened = images.reshape(len(images), -1)
print(f"Flattened images shape: {images_flattened.shape}")

Output:

Flattened images shape: (85, 4096)

Step 3: Centering the Data

This is a critical statistical step. To analyze the variance, we first need to shift the data so that the mean of each feature (pixel) is zero. This process is called centering the data. It ensures our analysis focuses on how the pixel values vary from the average dog face, not on the average brightness itself.

Action: Calculate the mean for each pixel column and subtract it from every image.

def centered(data):
    """
    Centers the data by subtracting the mean for each column (feature)
    Args:
        data: Input data matrix (n_samples x n_features)
    Returns:
        centered_data: Data with mean removed from each feature
    """    
    mean = np.mean(data, axis=0)
    centered_data = data - mean
    return centered_data

X = centered(images_flattened)
plt.imshow(X[0].reshape(64, 64), cmap='gray')

Visualization: Centered image data reconstruction

Step 4: Calculate the Covariance Matrix

The covariance matrix is the heart of PCA. It's a large (4096 x 4096) matrix that quantifies the relationships between all pairs of pixels. A positive value means two pixels tend to get brighter together, while a negative value means one gets brighter as the other gets darker.

Matrix Operation: This is calculated using a Matrix Transpose (.T) and a Dot Product (np.dot). The operation X.T @ X efficiently computes the sum of squared deviations, which is the core of the covariance calculation.
Action: Compute the covariance matrix from our centered data X.

def get_covmat(X):
    """Calculates the covariance matrix"""
    n_samples = X.shape[0]
    cov_matrix = np.dot(X.T, X) / (n_samples - 1)
    return cov_matrix

cov_mat = get_covmat(X)
print(f"Covariance matrix shape: {cov_mat.shape}")

Output:

Covariance matrix shape: (4096, 4096)

Step 5: Generating Eigenvalues and Eigenvectors

This is where the linear algebra magic happens. We perform "eigen-decomposition" on the covariance matrix.

Eigenvectors: These are the Principal Components. Think of them as the fundamental directions or "axes" of variation in the data. They point in the directions where the dog faces differ the most.
Eigenvalues: Each eigenvalue is a number that tells us the amount of variance captured by its corresponding eigenvector. A large eigenvalue means its eigenvector is a very important direction of variance.
Action: Calculate the eigenvalues and eigenvectors and sort them from largest to smallest.

import numpy as np
eigenvalues, eigenvectors = np.linalg.eigh(cov_mat)
eigenvalues = eigenvalues[::-1]
eigenvectors = eigenvectors[:, ::-1]
print(f"10 Largest eigen value: {eigenvalues[:10]}")

Output:

10 Largest eigen value: [51.456944 37.91145 21.08248 10.063868 9.286018 6.3022118 5.748278 5.1187453 4.921622 4.715191]

Step 6: Visualizing the Principal Components

We can reshape the eigenvectors back into 64x64 images to see what these "directions of variance" look like. These "eigendogs" represent the most common patterns and features in the dog faces.

Action: Reshape and plot the first 16 eigenvectors.

import matplotlib.pyplot as plt
fix,ax = plt.subplots(4,4, figsize=(20,20))
for i in range(4):
    for j in range(4):
        ax[i,j].imshow(eigenvectors[:, i*4 + j].reshape(64,64), cmap='gray')
        ax[i,j].set_title(f"Component number {i*4 + j + 1}")
        ax[i,j].axis('off')

Visualization: Top 16 Eigendogs

These ghostly images are the top 16 'eigendogs'. Think of them not as actual dogs, but as the fundamental building blocks or visual ingredients of a dog's face. Component 1 captures the most significant pattern, Component 2 the next, and so on. Every dog in our dataset can be described as a unique combination of these essential features.

Step 7: Applying PCA for Dimensionality Reduction

To reduce dimensions, we project our original data onto the new coordinate system defined by our top 'k' eigenvectors.

Matrix Operation: This projection is a Dot Product. We multiply our original data matrix X by the matrix of our top k eigenvectors.
Action: Project the data onto the first 2 principal components.

def apply_pca(X,eigenvecs,k):
    """
    Perform dimensionality reduction using PCA
    Args:
        X: Centered data matrix (n_observations x n_variables)
        eigenvecs: Matrix of eigenvectors (each column is one eigenvector)
        k: Number of principal components to use
    Returns:
        Xred: Reduced dimensional data (n_observations x k)
    """
    return X @ eigenvecs[:, :k]
Xred2 = apply_pca(X, eigenvectors, 2)
print(f'Xred2 shape: {Xred2.shape}')

Output:

Xred2 shape: (85, 2)

Step 8: Reconstructing Images with Different Components

To reconstruct an image, we project it back from the low-dimensional PCA space to the original 4096-dimensional space.

Matrix Operation: This involves a Dot Product with the Transpose of our eigenvector matrix, effectively reversing the projection.
Action: Reconstruct an image using 1, 5, 10, 25, and 50 components.

def reconstruct_pca(Xred, eigenvecs):
    # Reconstructs the data from its reduced form
    return Xred.dot(eigenvecs[:, :Xred.shape[1]].T)

def quick_reconstruction_demo(X, eigenvectors, images, img_idx=0):
    """
    Compare image reconstruction quality with different numbers of PCA components
    Args:
        X: Centered data matrix
        eigenvectors: PCA eigenvectors
        images: Original images for comparison
        img_idx: Which image to reconstruct (default: 13)
    """
    components = [1, 5, 10, 25, 50]  # Different component counts to test
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    # Show original image
    axes[0].imshow(images[img_idx], cmap='gray')
    axes[0].set_title('Original')
    axes[0].axis('off')
    # Show reconstructions with different component counts
    for i, n_comp in enumerate(components):
        Xred = apply_pca(X, eigenvectors, n_comp)
        Xrec = reconstruct_pca(Xred, eigenvectors)

        axes[i+1].imshow(Xrec[img_idx].reshape(height, width), cmap='gray')
        axes[i+1].set_title(f'{n_comp} components')
        axes[i+1].axis('off')
    plt.tight_layout()
quick_reconstruction_demo(X, eigenvectors, images, 15)

Visualization: Image Reconstructions

As you can see, with just 1 component, we get a blurry outline. By the time we use 50 components, the reconstruction is nearly identical to the original, demonstrating how PCA captures the most important features first.

Step 9: Identifying Similar Dogs

By representing dogs in a reduced PCA space, we can find similarities based on their core features. The number of components used is critical for accuracy.

Action: Find the closest pairs of dogs using a low-dimensional space (1 component) and a higher-dimensional space (30 components).

def find_similar_dogs(X, eigenvectors, images, components=30, n_pairs=3):
    """
    Find and display pairs of most similar dogs using high-dimensional PCA analysis
    Args:
        X: Centered data matrix
        eigenvectors: PCA eigenvectors
        images: Original dog images
        components: Number of principal components to use for similarity (default: 30)
        n_pairs: Number of similar pairs to find and display (default: 3)
    """
    Xred_high = apply_pca(X, eigenvectors, components)
    pairs = []
    for i in range(len(Xred_high)):
        for j in range(i+1, len(Xred_high)):
            dist = np.linalg.norm(Xred_high[i] - Xred_high[j])
            pairs.append((i, j, dist))
    similar_pairs = sorted(pairs, key=lambda x: x[2])[:n_pairs]
    # Plot results
    fig, axes = plt.subplots(n_pairs, 2, figsize=(8, 3*n_pairs))
    if n_pairs == 1:
        axes = axes.reshape(1, -1)
    for idx, (i, j, dist) in enumerate(similar_pairs):
        axes[idx, 0].imshow(images[i], cmap='gray')
        axes[idx, 0].set_title(f'Dog {i}')
        axes[idx, 0].axis('off')
        axes[idx, 1].imshow(images[j], cmap='gray')
        axes[idx, 1].set_title(f'Dog {j} ({components}D dist: {dist:.2f})')
        axes[idx, 1].axis('off')
    plt.suptitle(f'Similar Dogs ({components}D PCA)', fontsize=14)
    plt.tight_layout()

# Run similarity search with only 1 component
find_similar_dogs(X, eigenvectors, images, components=1)

# Run similarity search with 30 components
find_similar_dogs(X, eigenvectors, images, components=30)

Visualization: Similarity with 1 Principal Component

With only one component, the algorithm struggles to find true similarities, as it might only capture general brightness.

Visualization: Similarity with 30 Principal Components

Using just 30 components provides enough feature information (ear shape, markings) for a much more accurate and visually convincing similarity search.

Summary and Conclusion

In summary, the PCA process follows these key steps:

Prepare the Data: Flatten and center the data by subtracting the mean.
Compute Covariance: Calculate the covariance matrix to understand feature relationships.
Find Principal Components: Perform eigen-decomposition to find the eigenvectors (principal components) and eigenvalues.
Reduce & Reconstruct: Use the top eigenvectors to transform data to a lower dimension or reconstruct it back to the original space.

Conclusion: PCA is a remarkable demonstration of how linear algebra and statistics provide the foundation for powerful machine learning techniques. By transforming our data into a new coordinate system defined by its directions of highest variance, we can effectively reduce complexity, visualize patterns, and extract meaningful features from otherwise intractable high-dimensional datasets.

Glossary of Terms

Mean: A measure of central tendency, calculated as the sum of values divided by the number of values.
Centered Data: Data that has been adjusted so that the mean of each feature column is zero.
Covariance: A measure of how two variables change together.
Eigenvector: A vector whose direction remains unchanged when a linear transformation is applied to it. In PCA, they represent the principal components or directions of maximum variance.
Eigenvalue: A scalar that represents the factor by which an eigenvector is stretched or compressed after a linear transformation. In PCA, it represents the amount of variance captured by that eigenvector.
Matrix Transpose: An operation that flips a matrix over its diagonal, switching the row and column indices.
Dot Product: A mathematical operation that takes two equal-length sequences of numbers and returns a single number. In matrix multiplication, it is the fundamental operation for combining matrices.

Further Learning and References

For a deeper dive into the mathematical concepts behind PCA, these resources are highly recommended:

Coursera: Mathematics for Machine Learning and Data Science Specialization
YouTube: Essence of linear algebra by 3Blue1Brown

From Pixels to Patterns: A Visual Guide to PCA

Step 1: Loading the Data

Step 2: Flattening the Image Data

Step 3: Centering the Data

Step 4: Calculate the Covariance Matrix

Step 5: Generating Eigenvalues and Eigenvectors

Step 6: Visualizing the Principal Components

Step 7: Applying PCA for Dimensionality Reduction

Step 8: Reconstructing Images with Different Components

Step 9: Identifying Similar Dogs

Summary and Conclusion

Glossary of Terms

Further Learning and References

Comments

More from this blog

Solo Prompts to Symphonic AI: Building Multi‑Agent Workflows with CrewAI

Visualising Data with Seaborn - A Tutorial

Command Palette

Step 1: Loading the Data

Step 2: Flattening the Image Data

Step 3: Centering the Data

Step 4: Calculate the Covariance Matrix

Step 5: Generating Eigenvalues and Eigenvectors

Step 6: Visualizing the Principal Components

Step 7: Applying PCA for Dimensionality Reduction

Step 8: Reconstructing Images with Different Components

Step 9: Identifying Similar Dogs

Summary and Conclusion

Glossary of Terms

Further Learning and References

Comments

More from this blog