Decomposing Chennai’s Air Pollution Data

Introduction

Air quality has become a major cause of concern in Indian cities. Among IQAir’s 10 worst polluted cities, 6 belong to India (based on fine particulate matter concentration (PM 2.5)).¹ In this article, I attempt to decompose the hourly PM 2.5 concentration of Chennai between 2019 and 2021, to uncover the overall trend, monthly variations, and levels of pollution on holidays. This decomposition would enable us to get a preliminary understanding of pollution dynamics in Chennai.

For this article, I use data on PM 2.5 concentration measured at the United States Consulate in Chennai. The US consulate is located in the heart of Chennai, in Anna Salai (erstwhile St. Thomas Mount Road), an arterial road that connects the south of the city with the north.

Data Pre-Processing

I retain hourly data of PM 2.5 concentration that have been tagged as ‘valid’ according to quality assessment available with the data. All other data points are treated as missing, and are linearly interpolated (as in Zheng et al. (2019)). The hourly concentration data is then used to compute the ‘NowCast’ concentration, which is a weighted average of concentrations in the past 12 hours, to smoothen the data.² The ‘NowCast’ hourly concentration of PM 2.5 concentration is used as the basis for this study.

The daily average of the ‘NowCast’ concentration is presented in the figure below.

The Daily Average of NowCast PM 2.5 Concentration in Chennai between 2019 and 2021. The light blue shading represents the bootstrapped 95% confidence interval.

It is clear from the data that there are a few extreme outliers, however, the PM 2.5 concentration largely ranges between 0 and 50 micrograms per cubic meter. This is largely in the healthy/good to moderate classification of the Air Quality Index, occasionally reaching unhealthy levels.³ From the graph above, it is hard to identify the overall trend across the three years, and more importantly the level of pollution during holidays.

The Gaussian Process Model

In order to better understand the dynamics in pollution levels in Chennai, I use a Gaussian Process Regression to decompose the data into a trend component, monthly component, and special days (i.e. holidays) component. Details of the Gaussian Process model used can be found in the appendix.

First, we explore the results from the ‘trend’ component of the Gaussian Regression. In the figure below, I plot the ‘trend’ component from the Gaussian Regression (in red), over the hourly NowCast concentration data. The data has several extreme outliers at the hourly level, making it a little hard to understand the trend.

The Trend Component from the Gaussian Process overlaid on the Hourly NowCast PM 2.5 Concentration Data

To get a clearer picture, I remove the outliers from the graph. The resulting graph shows the trend in better detail.

The Trend Component from the Gaussian Process overlaid on the Hourly NowCast PM 2.5 Concentration Data — A Closer Look

The trend component appears to fit the data well, further it shows important trends caused due to the COVID-19 lockdowns. The peak lockdown in May 2020 is clearly seen in the trend. Further, around May 2021, we can see the impact of the lockdowns enforced to curtail the second wave of the virus.

Largely, the levels of pollution do tend to track the intensity of lockdowns enforced. The table below provides the various Government Orders (G.O.) passed by the Tamil Nadu Government with regard to lockdowns. The intensity tags provided in the table are qualitative evaluations of the restrictions provided in the G.O.s (along with some evidence from the number of cases during the period).

Table detailing the various lockdown notifications issued under the Disaster Management Act of 2005 by the Tamil Nadu State Government. Source: https://tnsdma.tn.gov.in/pages/view/covid-19-lockdown-gos

Next, looking at the monthly variations in the level of pollution, we do not see a very strong pronounced trend per se, again due to the presence of extreme outliers. When excluding the extreme outliers from the graph, we see that the months of January, November, and December have slightly higher levels of pollution, perhaps due to the relatively colder and moist climate that prevails during the period.

The Coefficients for the Month Indicator Variables in the Gaussian Regression, overlaid on the Hourly NowCast PM 2.5 Concentration Data (by month). The Mean Line represents the Overall Mean of Hourly Concentration between 2019 and 2021
The Coefficients for the Month Indicator Variables in the Gaussian Regression, overlaid on the Hourly NowCast PM 2.5 Concentration Data (by month) — Closer Look. The Mean Line represents the Overall Mean of Hourly Concentration between 2019 and 2021

Finally, when looking at the ‘special days’ (i.e. Tamil Nadu Government holidays)⁴, we clearly see that Bhogi and Deepavali form the peaks.

The Coefficients for the Special Days in the Gaussian Regression- The Mean Line represents the Overall Mean of Hourly Concentration between 2019 and 2021

However, it is important to note that their additional contribution is within 150 micrograms per cubic meter, meaning that it could be argued that their contribution alone is not sufficient to push the air quality to very unhealthy levels. However, it is important not to read too much into the results, as firstly the US Consulate is not located in a residential area. One would expect the levels of pollution during Deepavali/Bhogi to be higher in residential areas, where people burst firecrackers or burn old objects. Further, the data covers 2019–2021, of which two years — 2020 and 2021, were years when there was a ‘lockdown’ in place (albeit a little relaxed). Hence, the results cannot be taken as the norm.

Republic day has the lowest level of pollution, perhaps due to the reduced traffic on the day. Republic day is often associated with high levels of security and road blockages, perhaps leading to the lower levels of traffic and pollution.

Irrespective, it is clear that individual contribution of the festivals may not be sufficient to push pollution to very unhealthy levels, however, festivals along with usual contributors of pollution such as road traffic could potentially push pollution to very unhealthy or hazardous levels.

Conclusion

The data for years prior to 2018 is rather patchy, requiring a lot of interpolation (which may not be statistically tenable), hence the results in this article are based on data restricted to the three years between 2019 and 2021. It is important to note that these were particularly noisy years owing to the COVID-19 pandemic. With a longer timeseries we would be able to better identify the trends, and also greatly finetune our estimates.

Addendum — Closer Look at Bhogi and Deepavali

The results, specifically for the level of pollution during Deepavali/Bhogi, might seem counterintuitive. Hence, I did some more diagnostics. I simulated the data based on the posterior distribution of parameters estimated from the Gaussian Process Regression , and plotted 500 samples from the simulation (in blue) on top of the data (in black). The model performs rather well in tracking the data and also covers local peaks and troughs. However, as expected, the model fails to reach the extreme outliers on days like Deepavali and Bhogi. From my opinion, there is very little we can do to improve the model prediction, given the sparsity of data and the extreme levels of volatility in the levels of pollution on these days.

Posterior Predictive Simulation (500 Draws) plotted in Blue on top of the hourly PM 2.5 Concentration Data (standardized) in Black.

Note that in 2019, PM 2.5 pollution levels throughout the day on Bhogi was within 140 micrograms per cubic meter (4.63 times the standard deviation — roughly the level beyond which the model fails to predict). However in 2020, the levels of pollution for 5 hours on Bhogi day exceeded 140 micrograms per cubic meter. In 2021, only 3 hours on Bhogi day exceeded 140 micrograms per cubic meter. The sheer volatility in the occurrence of these extreme values will make it hard for any model to predict such levels. This could be perhaps resolved with data for additional years.

Likewise, 12 hours in 2019 on Deepavali day exceeded 140 micrograms per cubic meter. Whereas, in 2020, there was just an hour during Deepavali day when the pollution level exceeded 140 micrograms per cubic meter (owing to the lockdown). Further in 2021, 4 hours on Deepavali day exceeded 140 micrograms per cubic meter. Again we see substantial volatility across the years, as we saw with Bhogi.

This volatility in extreme pollution levels on Bhogi and Deepavali along with the overall reduced level of pollution during a majority of the day (perhaps due to the lockdown) is why the results show a lower level of addition to pollution on these days than one would expect.

Having said that, there is no doubt that these extreme values are only seen in a few hours of the day (except for Deepavali in 2019). The rarity of these extreme values would pull down the daily average. Hence, when looking at the day in entirety across the three year period, it would be safe to argue that the conclusions drawn in this article are robust, at least for the period considered (2019–2021).

Errata

There was an error in the NowCast code, which when corrected leads to slightly different results. While a vast majority of the results still prevail, the results for the contribution of Deepavali has increased from around 70 ug/m3 to around 102 ug/m3. While this does not overturn any of the main conclusions, it does invalidate the finding that Bhogi was worse in terms of air quality than Deepavali between 2019 and 2021.

Appendix — Technical Details of the Gaussian Process Regression

I use a simple Gaussian Process regression for the hourly data. The trend component of the regression comes from a mean zero gaussian process with a squared exponential (SE) kernel. For the monthly component and special days, I use indicator variables each multiplied by a coefficient. The sum of the two components gives the overall model.

To determine the parameters, I use a Bayesian approach, sampling from the posterior distribution. I use a standard normal prior for the magnitude parameter in the SE kernel and use a normal prior with mean zero and standard deviation 0.1 for the length parameter in the SE kernel. The coefficients for the indicator variables all have standard normal priors. I sample using the Hamiltonian Markov-Chain Monte Carlo algorithm (No U-Turn Sampler).

Solin and Särkkä (2020) provide a very helpful decomposition of the SE kernel (a stationary kernel) as an infinite sum that contains the spectral density of the SE kernel (evaluated at the square root of Eigen-values of the negative Laplacian operator over a compact domain) and the Eigen-functions of the negative Laplacian operator. The very helpful aspect of this decomposition is that the length and magnitude parameters of the SE kernel do not enter the Eigen-functions, and are only present in the spectral density. Further, it reduces the number of dimensions to just the number of basis functions (instead of the number of observations in the data). This makes inversion a much simpler affair during parameter training, substantially hastening the time taken to estimate the parameters.

Solin and Särkkä (2020) paper was used as the basis in another study by Riutort-Mayol et al. (2020), a more practical paper that uses the decomposition. I repurposed codes from Riutort-Mayol et al. (2020), specifically for the basis functions for my implementation of the Gaussian Process. I use 265 basis functions over a compact subset of the real domain (which ranges between -1.3 times the maximum ‘x’ value (i.e. time label) and +1.3 times the maximum ‘x’ value). Aki Vehtari’s notebook on the topic was also very helpful.

I have made a modification to the article since the first version of this article was published on May 8, 2022. I have re-estimated the regression with a larger number of basis functions (from 100 to 265), to get a better approximation — the length scale of SE kernel was rather small, necessitating a larger number of basis functions for a closer approximation. Irrespective, the conclusions are largely unchanged. The trend component has more ‘squiggles’ in the revised results, owing to the refinement in the length scale of the SE kernel, but the overall trend seems to largely persist (in terms of peaks and troughs). The coefficients of the indicator functions are also largely unchanged and the conclusions made in the original version prevail. There were some errors in the cutoffs for the AQI (for PM 2.5 concentration), I have corrected the issues in the updated version of the article.

While I tried to follow the iterative process suggested in Riutort-Mayol et al. (2020), I had to take larger steps in terms of adding basis functions due to computational constraints. When trying to increase the number of basis functions from 265 to 370, the Markov Chains for the posteriors were no longer convergent. The convergence diagnostic statistic, ‘R Hat’, which compares pooled variance of the posterior-draw chains with the within-chain variances, is much greater than 1 for the length parameter (it was 130.2). Ideally, convergence would imply this ratio to have a value of 1. Hence I stopped, iterating beyond 370 and present the results for 265 basis functions. It has to be recognized that the volatility in the data is quite high (informally squiggles), despite smoothing using the NowCast method. Hence, it requires a higher number of basis functions, for closer approximation. Having said that, the conclusions in the article appear to be rather robust.

Footnotes

[1] IQAir uses historical data between 2017 and 2021.

[2] I use the algorithm as described in the R documentation (https://cran.r-project.org/web/packages/PWFSLSmoke/vignettes/NowCast.html). However, since the formula for computing weights involves dividing by the minimum concentration (which could be zero), I treat data points with zero concentration as missing, hence the minimum is guaranteed to be non-zero.

[3] See: https://www.airnow.gov/aqi/aqi-basics/

[4] Source: https://www.tn.gov.in/holiday/2019, https://www.tn.gov.in/holiday/2020, https://www.tn.gov.in/holiday/2021

References

Riutort-Mayol, G., Bürkner, P. C., Andersen, M. R., Solin, A., & Vehtari, A. (2020). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. arXiv preprint arXiv:2004.11408. Link: https://arxiv.org/abs/2004.11408

Solin, A., & Särkkä, S. (2020). Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2), 419–446. Link: https://arxiv.org/abs/1401.5508

Zheng, T., Bergin, M. H., Sutaria, R., Tripathi, S. N., Caldow, R., & Carlson, D. E. (2019). Gaussian process regression model for dynamically calibrating a wireless low-cost particulate matter sensor network in Delhi. Atmospheric Measurement Techniques, 12(9), 5161–5181.
Link: https://amt.copernicus.org/preprints/amt-2019-55/amt-2019-55.pdf

--

--

--

Applied Econometrician

Love podcasts or audiobooks? Learn on the go with our new app.

Open source data analytics as a Digital Public Good

Annotation Tools Collection (aka Awesome Annotations)

Why Discount Future Rewards In Reinforcement Learning?

The Fascinating Science and Art of Decision Boards

Analyzing Spain COVID-19 positive cases using Python

Recreating W.E.B Du Bois’s Data Portraits

Cyber Crime: Machine Learning, Confusion Matrix Role in the Cyber Security.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Akshay Natteri

Akshay Natteri

Applied Econometrician

More from Medium

Satellite Imagery Analysis 101: Handling TIFF Files

EDA WITH PANDAS PROFILING

Full Set Of a Computer

On The Spatial Model Of Public Opinion, Part One: Exploring Graphical Options

Data Visualization in a loop using Seaborn and Matplotlib