Binary Classification Using Logistic Regression and MLP

Overview

This project, completed for the Statistical Learning for Data Science course, involved implementing Logistic Regression and a flexible Multi-Layer Perceptron (MLP) from scratch using Python and NumPy — without any deep learning framework. The goal was to deeply understand gradient-based optimization and the impact of architecture choices on classification performance.

MLP decision boundary and loss curve

Results

Logistic Regression on synthetic binary dataset:

Metric Value
Precision 0.9915
Recall 1.0000
F1 Score 0.9957
Final Loss 0.0180

MLP — Width Experiment (lr=0.025, epoch=250,000):

Neurons per Layer F1 Score Final Loss
3 0.5983 0.6342
6 0.7009 0.7484
20 0.8226 0.5845
50 0.9077 0.6618

MLP — Depth Experiment (lr=0.01, epoch=160,000):

Layers F1 Score Final Loss
2 0.9552 0.0346
3 0.9848 0.0127
6 0.9925 0.0043
10 1.0000 5.11e-06

Technical Details

  • Logistic Regression:
    • Forward pass: sigmoid activation on the linear combination of inputs.
    • Loss: binary cross-entropy (with clipping for numerical stability).
    • Optimization: vanilla gradient descent; weights and bias updated per iteration.
  • MLP:
    • Configurable layer sizes (e.g., [2, 20, 20, 1] for a 2-hidden-layer network).
    • Hidden layer activation: ReLU; output activation: sigmoid.
    • Full backpropagation implemented manually via chain rule.
    • Batch gradient descent; tolerance-based early stopping when loss reduction is negligible.
  • Dataset: Synthetic 2D binary classification datasets with controlled overlap; fixed random seeds for reproducibility.
  • Evaluation: Precision, Recall, F1 score, and loss curves; decision boundary visualization with contour plots.

Challenges

  • Vanishing gradients with deep networks: Observed that very deep networks sometimes failed to converge with sigmoid activations in hidden layers; resolved by switching hidden activations to ReLU.
  • Learning rate sensitivity: High learning rates (e.g., lr=0.3) caused the loss to diverge rather than converge. Empirically tuned learning rates per architecture; recorded the number of iterations for loss to drop to 30% of its initial value as a diagnostic.
  • Loss oscillation in MLP training: Unlike logistic regression, MLP loss sometimes fluctuated throughout training without reaching a clean plateau. Removing the tolerance-based early stopping and monitoring the full curve helped distinguish genuine convergence from oscillation.

Reflection and Insights

Implementing backpropagation by hand — rather than relying on autograd — forced a precise understanding of how gradients flow through a network and why architectural choices matter. The depth experiment aligned with theoretical expectations: deeper networks can represent more complex functions, but require careful learning rate selection and may suffer from instability. The width experiment confirmed that wider layers can compensate for shallower depth when the task requires capturing many parallel low-level features.

A particularly interesting observation was that in high-dimensional loss landscapes, saddle points become more common than local minima — meaning that with sufficient width, gradient descent rarely gets permanently stuck.

Team and Role

  • Solo project: Both models fully implemented and analyzed independently.

Binary Classification Using Logistic Regression and MLP

https://liferli.com/2023/10/15/projects/logistic-mlp/

Author

Zhiling Li

Posted on

2023-10-15

Updated on

2026-02-27

Licensed under