Statistical Learning for Data Science
Overview
This project series was completed as part of the Statistical Learning for Data Science course at Southern University of Science and Technology. The work covered two major tasks: applying hybrid deep learning and traditional machine learning pipelines for medical image classification, focusing on fundus lesion diagnosis. The project explored how feature representations from pre-trained deep networks can be combined with classical classifiers to achieve high accuracy with reduced computational cost.
Results
- Task 1 — Pre-trained Feature Extraction: Used ResNet18 as a frozen feature extractor; downstream classifiers (Linear Regression, KNN, SVM) achieved 100% accuracy on the test set, demonstrating the quality of ResNet18’s learned representations.
- Task 2 — Fine-tuned ResNet18: Fine-tuned ResNet18 end-to-end on the 3-class fundus dataset, converging in ~3 epochs with 100% test accuracy and near-perfect AUC across all classes.
- Bonus — Custom CNN: Designed a lightweight CNN from scratch using PyTorch, achieving 99.53% accuracy in 155 s training time vs. 348 s for ResNet18, demonstrating favorable speed-accuracy trade-off.
- Extension — 7-class Classification: Extended the fine-tuned ResNet18 to a 7-class problem; all classes achieved AUC = 1.00, validating the method’s scalability.
Technical Details
- Dataset: Fundus lesion images categorized into 3 (and later 7) classes; standard preprocessing with resize, normalization, and contrast adjustment.
- Hybrid Pipeline:
- ResNet18 (pre-trained on ImageNet) used as a backbone to extract 1000-dimensional feature vectors.
- Traditional classifiers (LE, KNN, SVM, MLP) trained on extracted features using
sklearn.
- Custom CNN Architecture:
- Convolutional channels:
[16, 32, 64], kernel size 3×3, max pooling with stride 2. - Grayscale edge-detected preprocessing (Canny, Gaussian blur) to reduce input redundancy.
- Fully connected MLP head for multi-class output.
- Convolutional channels:
- Training Setup: SGD optimizer (lr=0.001, momentum=0.9), cross-entropy loss, 3–5 epochs.
- Evaluation: Accuracy, ROC curves, and AUC per class; all reported in the final report.
Challenges
- Speed vs. accuracy trade-off: The custom CNN was significantly faster (2.2×) but slightly less accurate than ResNet18. The gap was attributed to the simplicity of convolution layers and grayscale conversion that discards color information.
- Feature quality vs. training cost: Frozen ResNet18 features were so discriminative that even linear classifiers achieved perfect accuracy, raising the question of when fine-tuning is truly necessary.
- 7-class generalization: Extending to a harder 7-class scenario required careful dataset balancing and preprocessing to maintain generalization.
Reflection and Insights
This project reinforced a key principle in applied machine learning: strong pre-trained feature representations can often substitute for expensive end-to-end training, especially when labeled data is limited. The hybrid approach — deep features paired with classical classifiers — offers a practical and interpretable alternative to black-box deep models in medical contexts. Designing the custom CNN from scratch also deepened understanding of how architectural choices (depth, width, pooling strategy) affect both accuracy and training efficiency.
Team and Role
- Team: Collaborated with two teammates on methodology design, experiments, and report writing.
- My Role: Led the custom CNN design and preprocessing pipeline; contributed to the hybrid pipeline experiments and analysis of width/depth trade-offs.
Statistical Learning for Data Science