I just wrapped up an intensive project with WorldQuant University, where I built an end-to-end predictive pipeline to forecast air quality (PM2.5 levels) in Nairobi.
Unlike standard tabular datasets, time-series data requires an entirely different engineering and modeling mindset. Here is a breakdown of the technical milestones and core takeaways from this project:
1. The Data Pipeline (NoSQL to Pandas)
Before modeling, you have to ingest the data. I worked with semi-structured air quality metadata stored in a MongoDB database.
Connected via Python drivers to query specific collection metrics.
Processed and flattened nested JSON structures into a structured Pandas DataFrame.
Handled irregular sensor frequencies by resampling data into a fixed hourly index and addressing missing values to maintain continuity.
2. Feature Engineering Through "Lags"
In time-series, you often do not have external features. The past becomes your input. By shifting the timestamp index, I engineered "lag features" (using the prior hour's air quality to predict the next hour). This self-supervision transforms raw sequences into a supervised learning matrix.
3. Deciphering the Math (ACF and PACF)
To determine how many historical lags our Autoregressive (AR) model actually needed, I utilized:
Autocorrelation Function (ACF): To measure the total correlation between current and past data points.
Partial Autocorrelation Function (PACF): To strip away the "noise" of intermediate steps and isolate the direct impact of a specific past hour on the present.
4. Why Traditional Validation Fails (Enter WFV)
One of my biggest takeaways was why we cannot use standard random train-test splits or K-Fold Cross-Validation for time-series. Doing so causes "data leakage" because a model cannot look into the future to predict the past.
Instead, I implemented Walk-Forward Validation (WFV). This mimics real-world deployment: the model predicts the next time step, tests against the actual result, folds that real result into its training history, and moves forward one step at a time.
Building this with statsmodels, pymongo, and scikit-learn has completely reshaped how I approach chronological data.