The government of India and the various departments under it generate vast amounts of data every day, a large amount of which is shared with the general public. In spite of this, Indian researchers and companies working in the field of data science and artificial intelligence (AI) have a big handicap: data. More precisely, the availability of well-labelled, feature-rich local datasets, which is also one of the biggest hurdles India will have to cross to get ahead in the AI race.
Avik Sarkar heads the Data Analytics Cell at NITI Aayog, the Indian government’s policy think-tank. Sarkar and his team are working towards democratising access to datasets from all quarters of the government. In this conversation with FactorDaily, Sarkar talks about the big data plan his team and he are working on and how the recently announced National Data and Analytics Platform (NDAP) by NITI Aayog is shaping up. Edited excerpts:
How did your tryst with data science begin?
I did my Bachelors in statistics and Masters in applied statistics and informatics from IIT-Bombay. Later, I did my PhD in statistical computer science. I have always worked in the intersection of data, statistics and information technology. One of the interesting projects I did during my studies was while interning at NASA in 2004. I was working on shuttle launch reports and doing text mining to see what all were the recurrent anomalies that stay across shuttle launches.
Could you give a brief bio about yourself and what you did before joining NITI Aayog?
Before joining NITI Aayog, I spent 15-16 years in the corporate sector working on various aspects of data — more on data analytics, data science, big data, and AI. This was primarily for private companies and corporations whose main intentions and objectives were to use data to gain competitive advantage, make processes linear, improve profits, etc.
In my last role at Accenture Consulting in Singapore, a lot of the projects were with the Singapore government and that’s where I got a very good feel of what working with government meant. Especially given the fact that the Singapore government is very forward-looking and is trying to use analytics and data science ion every aspect to make citizen services better.
What are NITI Aayog’s big plans for data science and AI in India?
One thing that you understand is that the government is usually a latecomer when it comes to data adoption or data for policymaking in India. If you look at telecom companies or banks, they are aggressive in adopting data for their benefit. We have been relying primarily on surveys. But in today’s date, surveys are only one part of it. You have also a lot of administrative records. For instance, the number of people going to a hospital, and getting a view on the disease trend in a city… these are based on that data. That is like administrative data. Then, there is also looking at big data. For instance, looking at sales of retail stores across the country to understand how product choices and people’s choices have changed over the years. I think India is still very much reliant on the first approach.
What are the various approaches to looking into data and how is this shaping India’s data journey?
There are three approaches. One is the survey, next is administrative data, and the third is mostly private data in huge quantities, which we call big data. We have been very reliant on the first approach of surveys till now. We are only now making baby steps to using the other two. So that is the goal of NITI Aayog, that we have mentioned in the ‘Strategy for New India @ 75’ document.
Currently, there is not a system in which we can get detailed data for the granular level, say, from our village or town to a centre level. Linking granular data, measuring them for administrative purpose to say how many people are going to school, measuring them at district level or village level and then aggregating that up to a district and then to a state and then to a centre level, so we know that in today’s date how many people went to a school. See, if I am asking you that question, it is very difficult to get how many people attended school today. So if you have the attendance records done digitally and linked at different levels, this could be possible.
There’s a large part of digital records and data that also comes to ensure that all the records are stored digitally at various points and then administrators will have access to this data at a lower level. And then, this will also flow up from the lower levels to the higher levels and from the higher levels once we see a pattern across states, maybe we can see what are the anomalies and then attend to them. Especially when you have large amounts of data from across the country to compare. For example, you could see possibly a chapter on, say, science has been taught in 10 days in certain states, whereas in other places it is taking one month. So in the places it is taking one month we can see the reason and make the required intervention in real time. Often what happens in government scenarios is that the interventions are made only when we know from learning outcomes and students have failed, or in cases like healthcare where people have got a disease and have died. Then it becomes part of a statistic that has been reported.
What can we do to intervene in real time — that is our objective. Our objective is that while something is going on in the process and is taking its first step towards deterioration, can we identify that and stop that and take the right remedial solution at that point. And that is where I think a lot of the administrative data and the framework models that we’ve proposed will come in to help.
How important is data science for governance and how do you plan to implement it in India?
Data is something that is available across ministries. For example, a good step has been taken towards electronic health records. Health records in big hospital chains are in digitised format but if you go to a government hospital the health records are not in digitised format. So first of all, the health records have to be in digitised format. These can be shared by masking a lot of the identifiable details, so a person accessing the data might not know the name or the person but get to know that there’s a person in so-and-so age group who had got so-and-so symptoms and this was the medication given. Privacy is also taken care of in these cases. Andhra Pradesh is an example of a state that has done a very good job in this.
While we were writing our strategy document, we visited a lot of states to understand what they are doing. In a lot of the states, you would’ve heard of the CM’s (Chief Minister’s) dashboard, which is basically a simple version of what we are talking about, a very similar concept where data from various aspects are coming and are fed into the dashboard in real-time. For example, it can be the number of lights that were on today, or a particular date. A lot of the smart cities are working on that and they have integrated command and control dashboards.
Some of these things have to be entered manually as all of the things are fed automatically to the system. For instance, for government employees we have a biometric-based attendance system. So someone can look across the offices and departments in all of Delhi to see which offices had good attendance today and how many didn’t, and these systems act in real-time. Having this sort of a system at a grassroots level is what we are looking at. That can then be linked at the state level and from the state to the central. From a village to the state level, they might get a lot of information but from a state when it comes to the centre level, we need not get so many detailed information of a particular state because we are looking at very macro numbers. So we might get a very high-level view.
We have also been discussing on what other granularity each of these data needs when transferring from one end to another end, because it is not that we want to have all the monitoring done at the centre. There is some monitoring part in the states also. Overall high-level monitoring at the centre is what is important and how one state is doing in comparison with other states. Is it improving? How is it decreasing?
The Aspirational Districts Programme is doing something at this level, where we have a dashboard. But in this particular case, data is not fed in real-time but there is a survey done every three to six months. There is some amount of administrative data that the district collector puts in the dashboard. And based on the dashboard, there is a website called Champions of Change where you can see what is the impact of this, what is the current position of that district in the ranking across seven different pillars like health, education, etc.
We are not talking about one project in our plan but multiple ones. And there’s a lot of common information and knowledge-sharing that also happens. We’re doing a lot of indexes for states, ranking states on various aspects. As I see it, there are three types of data — survey data, administrative data, and big data — and we want to be able to use and leverage all of them in real-time. The state indexing is the first. The aspirational districts programme is the next one and is a bit more advanced. The electronic health records plan is more advanced. It is in the pilot phase and yet to be realised. Then there is the electronic student records plan, which is in pilot phase across a few states.
This is only a part of a journey. The individual projects and systems need to have connect to each other from the individual level to the state level to the central level. It is also about educating people that this is not for scrutinizing or auditing but something for their good. Getting that message across becomes very, very important, because often people think that the moment you are sharing data, it will be used for your scrutiny.
You had spoken about the National Data and Analytics Platform (NDAP). What is it, what will it do and how does it fit into the plan?
When we talk about having all these data aggregated and having it in standardised formats, we want to host that on the National Data and Analytics Platform at NITI Aayog. It is not NITI Aayog’s data, but it is data that any ministry has on their portal that we will pull in and put in a single place. That is the objective of NDAP. Some might put the data in a web page format, some might put that data as a PDF, some in other formats. So it becomes very difficult to get those data in different formats for people to access at a single place. So this would be like a search engine for all government data in a single place, where it will be standardised, downloadable and can be compared. Ease of use is also very important. This will be available to all citizens, all general access, no restricted access to this. Researchers and companies can also have access to it.
Currently, the request for proposal for the NDAP is out on our website and we are in the process of selecting the vendor who will build, host and maintain the system for the next five years. Since it is a very big project, it will require a big team of developers and scientist to work on… Once the vendor is selected it should take 6-9 months to get it up and running.
When there already is a data.gov.in initiative, why has NITI Aayog decided to embark on the NDAP project? How exactly will NDAP be different?
Data.gov.in is a very good initiative but when it comes to datasets, there are a lot of ministries that put the data there (data.gov.in) and it is done by choice. But they also put a lot more information on their (minstry’s) own websites. Some of them have put the data there (data.gov.in) but there is no compulsion to put data there. Some of the datasets have been put there for two or three years and discontinued now. We are picking data directly from the ministry’s website — that will be our primary source — and convert them and standardise the format. Also, there are a lot of data on state government websites that are not captured at all, and we will be having data from there also.
Other than just data and datasets from various departments and states, what else will NDAP offer?
Analytics is a big part of NDAP. There will be some basic analysis to start with and users will also have access to self-service analytics. For instance, if a user wants to compare two things do some correlation or regression, maybe neural network is required or a decision tree is required, but as a user, you don’t need to worry about that. Or if you want to find out interesting patterns in the datasets, you might need to run machine-learning algorithms. That will also run in the backend. The user will not be exposed to all these backend functions, they will be exposed to a new simple analysis or comparisons as the end user. There are a lot of backend heavy analytics that will happen that until now were only available to researchers who had domain knowledge and access to high-end tools to get this sort of analytics done. Now, this will be made available to the end user at the click of a button. The whole objective is to democratize data and have democratized data-led discussion and analytics in the country.
NDAP will ensure that the datasets are updated and maintained on the platform. As and when the data is updated on the source portals we will pull them and convert the data to the required standard, update it and store it on the platform. We will pull the data from the sources rather than having to wait for someone to push the data onto NDAP.
NITI Aayog does not own the data it wants to make available. How do you intend to ensure that there is a constant flow of updated rich data from other departments?
For now, they (ministries, departments, states) will be putting information on their websites and we will be pulling it in from there. But there is also a plan that over time, ministries will give data to us via APIs. In the current plan, we need to perform double work, wherein data is first uploaded on the source website and then we download it, standardise and re-upload it onto the platform. Ideally, we would want the ministry-to-ministry information-sharing via APIs, which will save us a lot of effort, but that is a plan for the future.
[API is short for application programming interface, or code that allows apps or services to work with an operating system or service or another app.]
There are already some APIs available. But today if you are looking at 1,000 data points, only 20 or 30 have APIs. We have done a background study on that for the last six months. Our first intention was to use APIs for data transfer but we saw that if we he had to wait for APIs then we would have to wait for many years. So we decided that for now we will go with this method and then integrate the APIs once they start becoming available.
Does your plan with data end with just making it available to researchers and companies working in the data science and AI space or does it go beyond? What is the plan?
We want to make the data available to the startup community, the AI community, so they get access to the data and can develop apps on top of it. That is one of the additional objectives of sharing this data. To grow the AI ecosystem.
What are the kinds of datasets that you will be making available?
In the RFP, we have listed about 100 data sources. The primary focus areas include healthcare, agriculture, education, finances. But there’s a long list of 1,000-plus data sources. These are all government data sources. The government is publishing these datasets somewhere or the other. It is just that everybody might not know about it. Democratising that access is one of our goals.
How important do you think data science and data culture is in order to improve governance and do you see that happening in India with your project?
The NDAP data is also for policy-making. A lot of the government departments will also be using it and we will also be using it internally a lot. The Aspirational Districts Programme is something on this line.
What is the biggest challenge you are facing with the project and how are you tackling it?
Data source is one of the big challenges. We have been trying to list out the data sources. These data sources are in various formats and the lingo across the data sources vary a lot. For example, if look at the population in a certain place or the number of people living, it can be referred to in multiple ways.
One might say number of people, one might say population, or one might say headcount. So there are various lingos, and because we are expecting data from plain textual and PDF documents that are made for human consumption, when a machine is trying to consume that directly, it is not that easy. That is one of the biggest challenges for this project — ensuring standardisation across datasets.