Welcome to Mission Impostible, a team of emerging data scientists from around the globe who are boldly navigating the world of analytics β while quietly battling an all-too-familiar foe: Imposter Syndrome.
The name? A cheeky nod to how we feel sometimes. But while we may be new to the game, weβre not backing down. Weβre collaborating, learning, and growing togetherβone dataset at a timeβand weβre on a mission to tackle real-world problems with data.
- π Problem Background: Returns in E-commerce
- π¦ Returns Prediction: Graph-Based vs Tabular E-Commerce Modeling
- π§ Modeling Approaches: Why Two Methods?
- π Dataset Documentation
- π Summary Table
- ποΈ E-commerce Product Return Analysis β Data Preparation
- π Folder Structure
- π Data Exploration & Insights
- π€ Predictive Modeling & Results
- π‘ Key Takeaways
- π¨βπ©βπ§βπ¦ Team of Imposters
- π License
Product returns are a major challenge for online retailers, costing companies billions annuallyβnot just in lost sales, but also in reverse logistics, inventory disruption, fraud, and environmental waste. Managing returns impacts profitability and customer satisfaction, requiring robust data-driven solutions.
This project compares two advanced approaches to predicting product returns in e-commerce:
- Graph-based modeling (ASOS Returns Prediction)
- Tabular data modeling (TheLook E-Commerce)
Our goal: Understand and predict product returns using customer, product, and transaction featuresβempowering smarter business decisions.
Predicting returns is crucial for e-commerce logistics and customer experience. We explored:
- Each row = one purchase (features: product, customer, price, shipping, etc.)
- Models: Logistic Regression, XGBoost
- Strengths: Simple, interpretable, fits business rules
- Limitations: Treats transactions independently
- Nodes: Customers & products
- Edges: Purchases (labeled as returned/not)
- Model: Graph Neural Networks (GNNs)
- Strengths: Captures shared patterns, e.g., return-prone users or products
- Limitations: No time data, only includes customers with returns
- Source: OSF
- Data: Edge list with return labels, anonymized features
- Notes: No timestamps, only customers with returns
- How to get: Download from OSF
- Source: Kaggle
- Files:
order_items.csv
,products.csv
,users.csv
,distribution_centers.csv
- Notes: Some missing timestamps, synthetic PII
- How to get: Download from Kaggle
Model Type | Good For | Not Good For |
---|---|---|
Tabular | Easy to use & interpret | Finding relationships |
Graph | Complex pattern detection | Needs dense data, harder to explain |
Both methods provide valuable insightsβuse them together for best results.
We engineered modeling-ready datasets for both approaches:
- Input:
.p
files (event, customer, product nodes) - Process: Merge, clean, rename columns
- Output:
asos_merged_training.csv
- How to run: Use
01_data_preparation.ipynb
in Google Colab
- Input: Four CSVs (orders, products, users, centers)
- Process: Clean, engineer features (return flag, discount %, basket size, tenure, shipping latency)
- Output:
thelook_returns_features.csv
- How to run: Use
theLookdata_preparation.ipynb
in Jupyter/VS Code
.
βββ 0_domain_study
βββ 1_datasets
βββ 2_data_preparation
βββ 3_data_exploration
βββ 4_data_analysis
βββ 5_communication_strategy
βββ 6_final_presentation
βββ collaboration
βββ notes
βββ .github
βββ .vscode
βββ LICENSE
βββ README.md
βββ ... (other configuration files)
- Return rates: Most purchases are not returned; most customers return only once or twice.
- Key drivers: Product type, gender, country, age.
- Imbalance: Most records are "not returned".
- Visuals: Return frequencies, rates by demographic, product, geography.
- Return share: ~10% of items returned.
- Key drivers: Season, product category, country, distribution center.
- Numeric features: Weak linear correlation with returnsβnon-linear models needed.
- Visuals: Feature distributions, return rates by group.
- Models: Logistic Regression, Random Forest, GNNs
- Top features: Customer/product return history, product type, country
- Accuracy: ~75% (Random Forest)
- Limitations: No return reasons, only historical patterns
- Models: Logistic Regression (baseline), XGBoost (advanced)
- Best ROC-AUC: 0.655 (XGBoost)
- Top features: Product category, discount %, basket size, tenure
- Insights: High error in certain categories/geographies; numeric features alone not enough
- Graph and tabular models each reveal unique patterns.
- Product category, customer history, and geography are strong predictors.
- Combining both approaches can improve return prediction and business strategy.
π₯ Team of Imposters:
Β© 2025 | Mission Impostible β E-commerce Product Return Prediction Project