We developed a Random Forest algorithm to automatically detect the air quality condition in parts of the United States, along with a Tableau dashboard to visualize the detections and access its impact.
This tool is important because it will enable the researchers and key-decision makers to easily access the scope and impact of air quality conditions of specific regions in the United States. For instance, we have made effort in our solution to show that states with moderate air quality conditions have high population density, high number of automobile registrations and a high number of individuals who are at risk of chronic lower respiratory diseases.
The solution was demonstrated using a tableau interactive dashboard. The first interface visualization demonstrates the air quality category of the states that we incorporated. Based on the recommended breakpoints for 24-hour average air quality index, the range of the model's predictions fall into three categories: good, moderate and unhealthy. Upon clicking on the states, on can view the forecast of the air quality conditions for that state, which is produced by a random forest algorithm.

Furthermore, the second interface shows the 6 states with the worst air quality conditions in 2019. Users can infer the correlation that the states that have a high AQI is also have ample number of vehicles registered and also the number of chronic obstructive pulmonary disease (COPD) is higher.

We hope that, with further development, this tool can serve to provide relevant information to different groups both in the form of an archive for past data and a tool that leverages past data to predict future trends. We hope that other phenomena will be included in the tool as time goes on.
Our team took on this challenge because we were inspired by the idea of building a tool that could potentially save many lives just by automatically analyzing data from a variety of sources and putting this analysis into the hands of key decision-makers, as well as the general public.
Our approach in solving the problem involved investigating several machine learning models to automate the detection of hazards, building a dashboard to visualize the detections and incorporating ancillary data in an attempt to show the scope and impact of the detected hazards.
To develop the machine learning model, we utilized popular python libraries like scikit-learn and TensorFlow, which allowed us to explore the data, construct new features and evaluate several machine learning models. Throughout the hackathon we tried the following models: Linear Regression, Support Vector Regression, Random Forest Regression, Gradient Boosting Regression, XGBoost Regression, K-Nearest Neighbour, Bidirectional LSTM network, Bidirectional GRU Network, Multilayer Perceptron Network and 1-D Convolutional Network. We utilized Google Colab to train and evaluate these models. We used the R2 value as the primary evaluation metric. After many attempts at optimizing the models, our highest performing model was the Random Forest Regression model, which achieved an R2 value of 0.56.
We utilized Tableau to create a dashboard for visualizing the detections. Since our model is not very accurate, we attempted to demonstrate the impact of the solution by visualizing an aggregation of the detections. We also included ancillary data such as the population density of U. S. states (2019) and an estimate of the forest cover in each state. These ancillary data are intended as supplementary information for accessing the impacts of the detection.
The main problems we faced while working on this challenge are related to the data itself. Firstly, none of our team members had domain expertise in geological and meteorological data, so we did not have clear idea of the importance of the features of the dataset or how to combine them into more meaningful features. Secondly, the data had an inconsistent temporal resolution (data points are not consistently sampled every hour from each station), which made it especially difficult for our machine learning models. Finally, we also struggled while deciding which ancillary data to include in the visualization tool to create a more impactful solution.
Despite the challenges, our team put together a tool that we think will become useful as more development time goes into it.
In analyzing the problem, we have used the space agency’s air quality data. This data was used, in addition to other open data (see below) to train and evaluate several machine learning algorithms. Parts of the data were also used in the visualization tool.