Reference no: EM133394552
Report - Microbusiness density prediction team.
Just as a recap, our dataset primarily contains the activity index, industry and commerce dataset, census dataset, and the original microbusiness density dataset.
For boosting methods, we constructed the following features: lag features, basic features, encoding features, and rank features.
We then built features based on their neighbors. The k-NN model is based on several attributes, such as census data, microbusiness densities data, and their corresponding changes. The k-NN features improved our training loss and validation loss. The validation loss for LightGBM decreased by around 3% to approximately 2.3, showing that capturing neighbors as features for predictions is highly effective.
We used SMAPE (Symmetric Mean Absolute Percentage Error) as our time series evaluation metric. It is commonly used for time series tasks and, as you know, penalizes underpredictions more than overpredictions. There are other commonly used evaluation metrics that do the same, such as Mean Average Percentage Error (MAPE). However, MAPE has its limitations:
1. It cannot be used if there are zero or close-to-zero values, as division by zero or small values will tend to infinity.
2. Forecasts that are too low will have a percentage error that cannot exceed 100%, but for forecasts that are too high, there is no upper limit to the percentage error. This means that the evaluation metric will systematically select a method whose forecasts are low.
In contrast to MAPE, SMAPE has both a lower and an upper bound. The log-transformed accuracy ratio of MAPE actually has a similar shape compared to SMAPE.
In our task, when considering Type 1 and Type 2 errors for prediction, we would rather optimize for Type 2 errors to penalize our predictions for microbusiness densities that are far below true values. From a resource allocation perspective, underpredicting microbusiness densities can be harmful to those doing business since they won't receive the necessary resources. SMAPE penalizes underpredictions more than log-transformed MAPE, which is why we chose it.
Previously, we identified some key features contributing to our training. Some examples include the percentage of households with broadband access and the percentage of the population aged 25+ with a college degree. These features suggest that some economic indicators may be correlated with microbusiness densities, so we added more external datasets to our model.
Specifically, we added the following features: ... With these features added, the loss improved by another 3%.
Improving the model solely through feature engineering is challenging. What about engineering our target? We already transformed our target to the log difference. Can we do more?
The answer lies in outliers. For time series tasks, models are very sensitive to outliers. There are many outliers in our target across timestamps, especially for counties with lower populations. Some of the smallest counties have less than 1,000 people, and if one person suddenly decides to start a microbusiness, the density can change drastically. So, we applied smoothing to our target as follows: ...
Anomaly detection significantly improved our model, with the SMAPE loss improving by around 30%. This demonstrates the importance of smoothing in some time series tasks. Smoothing will be applied to data for all models moving forward, which will also be shown in other models.
After applying some tricks, here are the feature importances from LightGBM and XGBoost. With only the top 30 features, the model performs as well as it does with the full dataset. As you can see: ...
First, we aim to find a lower-dimensional representation of the distances between targets:
t-SNE: This method models pairwise similarities between data points in both the high-dimensional and low-dimensional spaces using probabilistic distributions. The loss function for t-SNE is the KL divergence between the pairwise similarities in high-dimensional space (p) and low-dimensional space (q). Pairwise similarities for high dimensions are modeled using Gaussian distributions, while the low-dimensional pairwise similarities are modeled using t-distributions with one degree of freedom.
Here is a clustering algorithm called DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise.
Our first approach is to use t-SNE to find the optimal low-dimensional representation of pairwise distances of the target and then perform clustering based on DBSCAN. However, the projection does not seem to be consistent. So, let's explore some other dimension reduction techniques.
This is UMAP (Uniform Manifold Approximation and Projection), which relies on Riemannian geometry of the manifold and algebraic topology. It uses fuzzy simplicial sets to approximate the underlying structure of the data. This method captures both local and global structures by considering the distances between data points in the high-dimensional space and building a topological representation of the data.
The loss function for UMAP is the cross-entropy between the pairwise similarities in the high-dimensional space (P) and the low-dimensional space (Q). To model the local and global structure of the data, high-dimensional pairwise similarities are based on the distance metric, and low-dimensional pairwise similarities are based on the negative exponential of the distance in the embedding space.
We chose to use UMAP mainly because it is better at preserving the global structure of the data, while t-SNE focuses more on local structure. UMAP is more consistent due to its deterministic initialization compared to the random initialization in t-SNE, and it scales better.
So, our second approach involves finding neighbors based on clusters using UMAP. The clusters seem to be reasonable, and in fact, there is a significant overlap of neighbors, similar to k-NN, from the previous approach.
After finding the neighbors, we can use these neighbors to construct graphs for graph neural networks. For each county, the counties' longitude and latitude are appended for distance calculation. Considering each county as the source, the destinations are the neighbors, and the weights are the normalized distances based on their geographical data.
Since we are predicting MD on a monthly level, a graph is generated for each month in the training data.
Here is our model architecture: it consists of 1, 2, 3... layers. On the right is our training process, where early stopping is employed, and the validation loss is actually very low, resulting in a performance of SMAPE around 0.9. The reason the model performs well with neighbors is due to graph convolution.
Graph convolution is an effective technique for processing graph-structured data, In a graph convolutional neural network (GCN), the node features are updated based on the features of their neighbors, allowing the model to learn a rich and expressive representation of the nodes in the graph. This is especially useful when dealing with spatial data, as it captures the local dependencies and relationships between the neighboring nodes.
In our case, the model performs well because the graph convolution layers can efficiently capture and learn the spatial relationships between neighboring counties. This leads to a high performance of our model, however, during training, we trained the model on very small batch size, we constarints the performance.
GCN generalizes the idea of convolutions from grid data to graph data. In contrast, there are 1D CNNs, which are particularly effective for time series analysis. They employ convolutional layers with filters that slide along the input data, capturing local patterns in the sequence. These filters are able to automatically detect and learn features, extracting meaningful information from the data without requiring manual feature engineering.
On the right is an example of applying different types of filters. Intuitively, some filters can smooth the time series data, which is very effective for our MD prediction.
For 1D CNN, we are using only MD as features, with more explanation to be provided later. Here is how we prepare data for 1D CNN:
...
After the convolution step, we aim to capture sequential data effectively. The candidate models include LSTM and GRU. Both models are types of recurrent neural networks (RNN) designed to handle sequential data effectively, and they can address vanishing and exploding gradient problems.
We have decided to use a GRU layer for the following reasons:
1. Computational efficiency: GRUs offer similar performance compared to LSTMs but are more computationally efficient. This is because LSTMs have three gates (input, forget, and output), while GRUs have only two (update and reset).
2. Fewer parameters: GRUs have fewer parameters than LSTMs, which reduces the risk of overfitting and speeds up training.
3. Shorter-term dependencies: In many cases, microbusiness densities are more dependent on recent values than on those from, say, two years ago. GRUs are better suited for capturing short-term dependencies because they have a simpler gating mechanism than LSTMs, which allows them to adapt more quickly to recent changes in the data. This is especially useful for our task, where the focus is on capturing the most relevant information from the recent past.
Here is the arechitecure of our models.
note that GRU layer is used after the 1D convolutions, by adding so, we can model the long-range dependencies and temporal relationships in the data, The combination of 1D CNN and GRU layers takes advantage of the strengths of both architectures. The 1D CNN captures local patterns and features in the data, while the GRU layer models the long-range temporal relationships. This combination helps in obtaining a richer representation of the time series data, leading to better predictions.
The 1D CNN + GRU model provides similar performance to the GNN, and it is important to note that we are only using microbusiness densities for prediction. This model architecture has potential for a wide range of time series tasks, not just for MD prediction.