Retail businesses thrive on their ability to anticipate demand. Understanding sales patterns allows companies to optimize inventory, allocate resources efficiently, and maximize revenue.
But how do we accurately forecast future sales?
This research is based on a Kaggle competition focused on forecasting store sales using historical data from a retail store in Ecuador, Corporación Favorita (2013-2017).
We compare traditional regression methods with machine learning techniques like XGBoost to uncover what drives store performance.
If you are interested in our more in-depth analysis you can head over there by clicking this link here.
Importance of Store Sales Forecasting
Sales forecasting is more than just predicting future numbers—it’s about making data-driven business decisions that can significantly impact operations. Accurate predictions help:
- Prevent Overstocks & Shortages – Avoid tying up capital in excessive inventory or missing out on sales due to understocking.
- Optimize Staffing & Operations – Schedule employees effectively based on demand fluctuations.
- Plan Marketing & Promotions – Identify peak sales periods and allocate promotional budgets more effectively.
- Improve Supply Chain Management – Ensure that products are available in the right locations at the right time.
However, forecasting sales is complex due to multiple influencing factors—seasonality, holidays, store locations, promotions, and even external economic conditions like oil prices.
Our Approach: Best Machine Learning Model
We tested different forecasting approaches to compare their effectiveness:
Linear Regression
A classic statistical model that establishes relationships between independent variables (like holidays, promotions, and past sales) and the target variable (future sales).
While simple and interpretable, it assumes linear relationships, which can limit accuracy.
XGBoost (Extreme Gradient Boosting)
A powerful ensemble learning method that improves predictions by iteratively learning from previous mistakes. It uses decision trees as its base learners, combining them sequentially to improve the model’s performance. Unlike linear models, it captures complex patterns and interactions between variables.
Key Findings & Insights from our Analysis
Sales and Holidays Observation
Surprisingly, despite a weak direct correlation (0.0140) between holidays and sales, our feature importance analysis revealed that holidays played a dominant role in sales prediction—especially in the linear regression model.
This suggests that while holidays may not have a strong uniform impact on sales overall, they play a critical role in predicting demand for specific product categories.
This contradiction arises because correlation only captures a direct linear relationship, while regression models can weigh multiple interactions.
Forecasting Result
The predictions revealed seasonal trends, with higher sales observed around public sector salary payment dates and significant fluctuations following major holidays and economic events. Notably, the 2016 earthquake had a residual impact on sales patterns, highlighting the importance of external shocks in retail forecasting.
Comparing Machine Learning Models
When evaluating performance, XGBoost far outperformed Linear Regression:
XGBoost delivered much higher R² scores (closer to 1), lower errors (MAE & RMSE), and captured trends more effectively. This proves that tree-based models handle complex relationships better than linear models.
Feature Importance Reveals Key Sales Drivers
Our feature importance analysis uncovered the most influential factors behind sales:
🔹 Store & Product Family – Different stores and product families exhibited varying demand patterns.
🔹 Past Sales Trends (Lags & Rolling Averages) – Recent sales data, including 7-day and 30-day averages, were strong predictors.
🔹 Promotions & Holidays – Promotional events boosted sales, while holidays influenced demand differently across stores and product categories.
🔹 External Factors – Macroeconomic indicators such as oil prices and seasonal trends played a role. Furthermore, external factors like economic policies, political events, and natural disasters played a role in sales fluctuations.
Final Thoughts: What This Means for Businesses
📊 Data-Driven Decision Making is Key
Retailers can no longer rely on intuition alone. Advanced forecasting models like XGBoost empower businesses to make strategic decisions with confidence.
🔍 Machine Learning Provides a Competitive Edge
Traditional models have their place, but machine learning unlocks deeper insights by capturing non-linear relationships. Implementing such models can improve demand forecasting and overall efficiency.
💡 Holidays & Promotions Must Be Planned Carefully
Despite the weak direct correlation, regression models highlight the cumulative impact of holidays and promotions on sales. Businesses should holistically analyze these factors when planning their marketing strategies.
What’s Next?
👉 Want to leverage data analysis for smarter business decisions? DataHen can help you turn data into actionable insights. Contact us today!