GitHub - ezylopx5/AmazonMLHackathon: Amazon ML Challenge 2025 — Smart Product Pricing A multimodal machine learning solution combining text, numerical, and image features to predict product prices using advanced feature engineering and ensemble modeling (XGBoost, LightGBM, CatBoost).

🧠 Amazon ML Challenge 2025 – Smart Product Pricing (Multimodal Price Prediction)

A comprehensive machine learning pipeline developed for the Amazon ML Challenge 2025, focused on predicting e-commerce product prices using a multimodal dataset (text + numerical + image features). This repository contains all code, data-handling scripts, and our final ensemble-based approach achieving a 57.77 SMAPE on the leaderboard.

⸻

🚀 Project Overview

The challenge required predicting product prices from multiple data types: • 📝 Text features: product titles + descriptions • 🔢 Numerical features: specs, dimensions, ratings • 🏷️ Categorical features: brand, category, subcategory • 🖼️ Image features: extracted using ResNet (when available)

Even with significant data challenges (≈ 99.8 % missing images, high price variance), the solution emphasizes: • Robust preprocessing • Advanced feature engineering • Model ensembling • Efficient memory & computation management

⸻

🧩 Solution Architecture

1️⃣ Feature Engineering Pipeline

Component Techniques Used Text Processing TF-IDF vectorization of titles & descriptions Numerical Features Log/sqrt transforms, polynomial features Image Features ResNet-based embeddings (zero-filled when missing) Categorical Handling Target encoding, one-hot for frequent brands/categories Scaling RobustScaler to handle outliers Imputation Median/mode + category-based defaults

⸻

2️⃣ Advanced Feature Creation • Price interaction ratios and polynomial combinations • K-Means clustering for similarity-based grouping (k = 5, 10, 20) • PCA components to reduce high-dimensional features • Per-category statistics (mean, std, median price) • Target encoding with smoothing

⸻

3️⃣ Model Ensemble Strategy

A weighted ensemble of three gradient-boosting regressors:

Model Key Config Notes 🧠 XGBoost 1000 trees, lr = 0.05, max_depth = 8 Main baseline ⚡ LightGBM 1000 trees, lr = 0.05, boosting_type = ‘gbdt’ Faster convergence 🐈 CatBoost 1000 iters, lr = 0.05, depth = 8 Best single-model performer

Ensemble weighting: inverse SMAPE per model Final post-processing: clipping (0.01 ≤ price ≤ 50 000) + rounding

⸻

📊 Cross-Validation and Evaluation • 7-fold cross-validation (stratified by price bins) • Metric: Symmetric Mean Absolute Percentage Error (SMAPE) • Validation consistency: common folds across all models

Final Leaderboard SMAPE: 57.77 %

⸻

⚙️ Implementation Details

🧮 Feature Selection 1. Mutual Information → Top 300 features 2. F-Score selection → Refine to 250 most informative 3. Correlation filter → drop > 0.95 correlated features

⚡ Performance Optimizations • Memory-optimized datatypes (float32) • Batch processing & feature caching • Parallelized feature extraction • Early stopping + grid search hyperparameter tuning

⸻

📁 Repository Structure

amazon-ml-2025/ ├── src/ │ ├── main.py # Main training entry │ ├── data.py # Data loading, preprocessing │ ├── features.py # Feature engineering utilities │ ├── train_ultra.py # Model training pipeline │ ├── utils.py # Helper functions │ └── config.py # Configuration settings ├── scripts/ │ ├── create_submission.py # Final CSV generation │ ├── validate.py # Validation analysis │ └── quick_submission.py # Rapid local testing ├── dataset/ │ ├── train.csv / test.csv │ ├── sample_test.csv │ └── images/ ├── output/ │ ├── checkpoints/ logs/ validation/ └── submissions/

⸻

🧠 Key Learnings

💡 From this project I learned to: • Systematically analyze and decompose a real-world ML problem • Design a complete workflow — from EDA to model validation • Handle multimodal and imbalanced data • Identify and fix bottlenecks during training • Build ensemble systems that generalize better

⸻

🤝 Acknowledgments

Big thanks to my teammates for their support throughout the hackathon, and to the Amazon ML Challenge 2025 organizers for creating such a learning-rich experience!

⸻

🧰 Tech Stack • Python 3.9 + Pandas + NumPy + Scikit-learn • XGBoost | LightGBM | CatBoost • Matplotlib / Seaborn (for analysis) • PyTorch (ResNet image embeddings)

⸻

📜 License

This project is released under the MIT License.

⸻

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
dataset		dataset
output		output
scripts		scripts
src		src
submissions		submissions
.DS_Store		.DS_Store
Leaderboard-AmazonML-Hackathon.jpg		Leaderboard-AmazonML-Hackathon.jpg
README.md		README.md
SOLUTION_APPROACH_DOCUMENT.md		SOLUTION_APPROACH_DOCUMENT.md
baseline_submission.py		baseline_submission.py
create_submission.py		create_submission.py
create_submission_package.py		create_submission_package.py
pipeline.py		pipeline.py
preflight_check.py		preflight_check.py
requirements.txt		requirements.txt
setup.py		setup.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages