Are Indian Sample Surveys Faulty?

10 min readJul 23, 2023

Introduction

In the past few weeks, there has been a raging debate about the quality of India’s household surveys. For those new to this term, ‘Household Surveys’ are elaborate questionnaires on key socio-economic topics that are administered to a scientifically drawn sample of households (such that they closely represent the population).¹ Household surveys are a hallmark of modern statistics. These surveys provide an extremely cost-effective way of accurately estimating key metrics such as poverty rate, unemployment rate, average wages, etc. Household surveys are of particular importance to India. The legendary statistician P. C. Mahalanobis pioneered large scale sample surveys in the early 1950’s, to help design India’s 5-year Economic Plans.² Household surveys have been at the heart of policy making ever since in India.

The debate stems from a recent article in The New Indian Express by Prof. Shamika Ravi. In the article, Prof. Ravi argues that India’s sample surveys (such as the National Sample Survey, Periodic Labour Force Survey, and National Family Health Survey) employ an archaic and faulty sampling design that leads to systematically under-representing the urban population. This, Prof. Ravi argues, results in biasing key metrics such as the poverty rate, average wages, and unemployment rate — masking the true impact of India’s economic growth.

Prof. Ravi’s views were strongly contended by other eminent personalities in the policy sphere such as Dr. Pronab Sen, Dr. Amitabh Kundu , and Dr. P C Mohanan. The debate sparked widespread discussion on social media platforms including Twitter. Rebutting her critics, Prof. Ravi shared a seminal paper by Prof. Bruce D. Meyer, Dr. Wallace K. C. Mok, and Prof. James X. Sullivan titled “Household Surveys in Crisis”, published in the Journal of Economic Perspectives in 2015.

Prof. Ravi’s tweet on July 17, 2023.

This paper means the world to me as it was among the first papers I ever read as a student of economics. One of my earliest tasks as a research assistant under Dr. Wallace K. C. Mok was to read this paper. I found this topic so interesting that I later went on to pursue a masters degree at the University of Chicago, where I had an opportunity to complete an apprenticeship under Prof. Meyer. Most professionals in India almost instantly discount my experience as a research assistant in Hong Kong and the US, claiming my experience in working with US data had no relevance to India. For once, I find my research experience to be of some relevance to an ongoing debate in India.

During the course of my apprenticeship, I had worked on assessing the quality of the American Housing Survey, under the supervision of Prof. Meyer. I learnt one big lesson from that exercise, in the United States administrative data is often more reliable than survey data. In India though, it is quite the opposite.

Benchmarking Household Survey Data in the United States

The household surveys in the United States are often benchmarked to administrative data.³ For example, in Meyer, Mok, and Sullivan (2015), the authors compare the aggregate amount of benefits (such as the Aid to Families with Dependent Children and Supplemental Security Income) estimated from major American surveys with the administrative data of disbursal (see Table 1 of Meyer, Mok, and Sullivan (2015)). They find an increase in the underreporting of benefits with time, across multiple household surveys.

Further, the United States also produces the National Income and Product Accounts (NIPA) which provides crucial information on the aggregate wages and consumer expenditure of the nation. NIPA draws on a wide range of robust datasets. The aggregate of earnings, for example, is estimated from the Quarterly Census of Employment and Wages (perhaps the most comprehensive source for wage data, covering about 95% of U.S. jobs).⁴

The NIPA aggregates can hence be used as a benchmark to compare household survey estimates with. Corinth, Meyer, and Wu (2022) do just that.⁵ Benchmarking earnings with NIPA, they find an increase in underreporting of earnings between 1995 and 2016 in the Current Population Survey — Annual Social and Economic Supplement (a key household survey used to measure poverty, among other things, in the US) (see Table 1 of Corinth, Meyer, and Wu (2022)).

Reliable aggregates such as NIPA do not exist in India, which makes it particularly hard to measure the extent of underreporting/overreporting of key metrics such as income/poverty.

Benchmarking Household Survey Data in India

There are, however, some cases in India where administrative data exists. A prime example is the Mahatma Gandhi National Rural Employment Guarantee Act (MNREGA). MNREGA is a program that provides 100 days of guaranteed employment to rural citizens.⁶ Eligible citizens would need to apply for a ‘job card’. Once the job card is issued, they would be deployed for 100 days to work on small rural projects (such as the construction of wells, dredging of river beds, etc.). The Act was passed in 2005 in hopes to provide at least a basic livelihood for the rural citizens of India. Information on the job cards, payments, and projects are uploaded to a Management Information System (MIS). The MIS is the base for key administrative data of MNREGA.

A study by Usami and Rawal (2012)⁷ compared administrative data on the number of MNREGA cards issued with estimates from the 66th Round of the National Sample Survey (abbreviated as NSS, a key household survey in India). The administrative data indicated that over 116 million households were issued with cards, while the estimates from the NSS indicated that only around 56.5 million households were provided with cards. This difference between the household survey estimate and the administrative data is alarming. While the definition of a household in the NSS is different from MNREGA, the authors note that it still doesn’t account for such a wide deviation. Does this mean then that like in the US, there is widespread underreporting of benefits? The answer is in fact no.

Unlike the US, the problem in India is that administrative data is highly unreliable. The MNREGA program was plagued with corruption, chief of which was the creation of fake cards. Often, heads of villages would create ghost cards (using non-existent addresses and names) to claim wages, that they can then divert to their pockets.⁸ A hilarious example is where a job card was created using Deepika Padukone’s photo (a popular actress and model) in Madhya Pradesh.

In fact, the corruption in MNREGA was a major bone of contention between the then ruling Congress government and the then opposition Bharatiya Janatha Party (BJP).

Finance Minister Smt. Nirmala Sitharaman made a speech in the Rajya Sabha about the problem of ghost accounts in MNREGA on February 11, 2022.

All this led to an audit of the program by the Comptroller and Auditor General of India (Report №6 of 2013, Union Government). The report produced state-wise statistics on the average number of households that were provided employment under the program between 2009–10 and 2011–12. The statistics were sourced from the program’s MIS.

I compared the average number of households employed under MNREGA in the CAG report with estimates from the 66th and 68th Rounds of the NSS (see table below). The results are staggering. The NSS estimates almost 10 million households lesser than the administrative statistics. What do we infer from this? Is the NSS systematically underestimating the impact of MNREGA? Is the NSS’ sampling design faulty? The Congress government would have certainly argued that.

A comparison of the estimated average number of households employed under MNREGA from NSS with the numbers produced in CAG Report №6 of 2013, Union Government. It is important to note is that a year runs from July to June in the case of the NSS, while a year runs from from April to March in the CAG data. The slight difference in the definition of a year still fails to explain such a massive deviation between the estimates.

The answer is that the NSS actually estimates closer to the true number of accounts, however it is the administrative data that is plagued by issues such as ghost accounts. In fact, when the BJP government took office, a thorough audit of the administrative data was performed and around 10 million fake accounts were removed.⁹ The number of fake accounts is at least in orders of magnitude similar to the deviation between the administrative numbers and the NSS estimates.

Further, in states like Andhra Pradesh where the government was particularly slow in updating the MIS system with latest data, we observe that the NSS estimates are higher than the numbers in the CAG report. This again, goes in favor of the argument that much of the deviation is due to poor administrative data.¹⁰

Long story short, MNREGA is one example where the sample survey got it right while the administrative data was plagued with issues.

Do Indian Sample Surveys perform well in Urban metrics?

The MNREGA example, albeit startling, is still restricted to the rural population. Prof. Shamika Ravi’s main concern was the underrepresentation of urban population in Indian household surveys.

To test this, I compare the estimates for the number of employees in the organized manufacturing sector using two surveys, the Periodic Labour Force Survey (PLFS) and the Annual Survey of Industries (ASI). The PLFS is a survey of households, while the ASI is a survey of factories. The sampling strategies are widely different. The PLFS samples (in the first of two stages) from the list of villages in the 2011 Population Census and from Urban Frame Survey Blocks (small blocks of a urban area). On the other hand, the ASI primarily samples (in the first of two stages) from the list of all factories registered under certain sections¹¹ of The Factories Act, 1948.¹²

Despite having wildly different sampling approaches, the estimates of the number of employees in the organized manufacturing sector are strikingly similar.

A comparison between the number of employees in the organized manufacturing sector, as estimated from the ASI vs. the PLFS. To ensure comparability, in the PLFS, I only keep those working in the manufacturing sector (NIC Codes between 10101 and 32909) in organizations with 10 or more workers. In the ASI, I restrict the NIC codes as in the PLFS to only the core manufacturing sector. It is important to note that the PLFS runs from July to June, while the ASI runs from April to March. The COVID-19 lockdown was enforced in March 2020, this could be the reason behind the higher discrepancy between PLFS and ASI in 2019–20. It is important to note that the PLFS only provides sampling design-based control totals, which should be used to compute ratios such as the Labor Force Participation Rate. These ratios have to be used on population projections to arrive at population totals. Given the dearth of essential administrative benchmarks, I perform this analysis for a rough comparison. The conclusions from the comparability of the estimates still hold.

It is important to note that a majority of the factories (around 55%) are in urban areas. Despite that, the PLFS and ASI provide almost similar estimates.

If two surveys, employing absolutely independent surveying methods of fundamentally different units (i.e., households vs. factories) come to very similar estimates of a metric, it only goes to show that the surveys are robust and capture the reality, albeit with minor inaccuracies.

This again proves that the Indian sample surveys are robust — at least for basic urban and rural metrics.

The problem with Indian survey data is not so much that of robustness as it is that of frequency. Data delayed is data denied. New tools and technologies need to be leveraged to update data in a more frequent and robust manner. This is what India needs urgently.

Conclusion

In the analysis above, I have attempted to show how India’s sample surveys are the last bastion of hope, given the pathetic quality of administrative data. Administrative data in India is often riddled with problems stemming either from corruption (such as in the case of MNREGA) or from gross mismanagement. This is specifically true in key social sectors. A recent discussion paper of the Ministry of Statistics and Programme Implementation states “the quality of administrative data in both the health and education sectors have dropped alarmingly”.¹³ Further, most administrative data would completely miss the unorganized sector, which contributes to around half the nation’s GDP. Hence, a blind porting of solutions from the US (such as linking survey microdata with administrative microdata) may not serve India any good at this stage.

Having said that, this doesn’t mean there is no need for change. The rapid pace at which the economy is digitizing provides a bright beacon of hope to further improve our statistics. Given that even the unorganized sector in India is using digital solutions such as the Unified Payments Interface, online banking, and Aadhar (unique identity card), it should be possible to effectively collect anonymized data from the digital footprints, without impinging on anyone’s privacy.

However, to fully unleash the potential of leveraging digital footprints to enhance our statistics, there needs to be a willingness to accept more non-standard data sources in the Government’s data arsenal. While the survey methods are undoubtedly of high quality and may continue to play a key role, what India needs is a Swiss-army knife of new data sources to go with it. These data sources have to be ones that are frequently updated and which can provide a more up-to-date picture of the Indian economy.

Apart from investing in new data sources, it is also essential for the Government to invest in the current set of administrative data. There is a dire need to validate administrative data and introduce clear protocols for their collection, verification, and management. While investing in data quality audits, data management systems, and data handling protocols may not be glamorous, this is what India needs to focus on first. This has to be complemented with a complete digitization of all key internal and citizen facing processes of the Government (at all levels).

If we are serious about our ambitions of becoming the third largest economy by the end of this decade, we need to start investing in data to inform policy making. Without that, India would only be shooting in the dark, hoping to make it as the third largest economy by pure chance.

Small Side note

We owe much of the success of household surveys to one person, P. C. Mahalanobis. He was among the best statisticians in his time and played a tremendous role in building institutions such as the Indian Statistical Institute and the National Sample Survey Organization.

Here are some interesting information about him:

Footnotes

1: Wright, J. D. (2015). International Encyclopedia of the Social & Behavioral Sciences. Accessible at: https://www.sciencedirect.com/topics/social-sciences/household-survey

2: Official Website of the Indian Institute of Statistics, Kolkata: https://www.isical.ac.in/~repro/history/public/notepage/NSSO-F.html

3: The Ministry of Statistics and Programme Implementation defines administrative data as “data sets collected by government institutions or agencies for tax, benefit or public administration purposes”. See footnote 13 for a full citation of the Ministry’s report.

4: See https://www.bea.gov/resources/methodologies/nipa-handbook/pdf/all-chapters.pdf and https://www.bls.gov/cew/

5: Corinth, K., Meyer, B. D., & Wu, D. (2022, May). The Change in Poverty from 1995 to 2016 among Single-Parent Families. In AEA Papers and Proceedings (Vol. 112, pp. 345–350). 2014 Broadway, Suite 305, Nashville, TN 37203: American Economic Association. Accessible at: https://www.aeaweb.org/articles?id=10.1257/pandp.20221043

6: For the full Act, see: https://rural.nic.in/sites/default/files/nrega/Library/Books/1_MGNREGA_Act.pdf

7: Usami, Y., & Rawal, V. (2012). Some Aspects of the Implementation of India’s Employment Guarantee. Review of Agrarian Studies, 2(2369–2021–101). Accessible at: https://ras.org.in/some_aspects_of_performance_of_mgnrega

8: See page 81 of Usami and Rawal (2012).

9: See Hindustan Times Article dated April 09, 2017. Accessible at: https://www.hindustantimes.com/india-news/fund-leakage-nearly-a-crore-fake-job-cards-struck-off-from-mgnrega-scheme/story-JpsHg1k0mKNE5BNxKF4aKL.html

10: See page 87 of Usami and Rawal (2012). Further, 83% of households in MNREGA have only one card (computed from NSS 68). Hence, this makes the discrepency between number of cards and households much narrower.

11: The sections are 2(m)(i) and 2(m)(ii).

12: See the User Manual for the ASI accessible at: https://www.ecostat.kerala.gov.in/storage/documents/10.pdf

13: Administrative Data: Issues, Concerns and Prospects. An Indian Perspective. March 2021. Ministry of Statistics and Programme Implementation, Government of India. Accessible at: https://mospi.gov.in/sites/default/files/Administrative%20Data%20v1.0.pdf