How to Digitize the Whole of Russia: Big Data and Predictive Modeling

Aim

Based on the dataset, train neural networks and set up a predictive analysis system to predict possible outbreaks of diseases on a federal scale. To do this, configure and implement the document scanning and recognition module, as well as the digitized data analysis module.

Context:

Marketing Logic participates in the development of the analytical complex of the All-Russian monitoring of collective immunity against infections of the Research Center of Epidemiology and Microbiology named after N.F. Gamalei of the Ministry of Health of the Russian Federation. The complex allows you to store, collect and process large amounts of information about the protection of the population of the Russian Federation from infectious diseases, including preparing statistical reports and building models and forecasts based on predictive analytics. The ML team fully performs the technical part of the project implementation. The project description contains data that is allowed to be published by the terms of the NDA.

Key indicators

The number of recognized documents is more than 40 million.

Coverage area is the whole territory of Russia.

Solution

Officially, the project of the Ministry of Health of the Russian Federation is called "Monitor-Bio", it is a software and hardware complex for processing and geoinformation statistical analysis of data from the all-Russian seroepidemiological monitoring of collective immunity of infections. In simple words it is a system for monitoring outbreaks and the spread of diseases caused by various infections, and the ability to predict them.

The project is based on big data - millions of questionnaires collected throughout the country, as well as taking into account the number and density of residents, the availability of roads, infrastructure-down to the traditions and way of life of certain regions. To make accurate forecasts, just like in business, you need as much data as possible so that you can identify non-obvious dependencies on variables.

For us, the greatest technical challenge in this project, which had to be overcome, was the accuracy of the recognition of questionnaires. The Action module.Docs was already there, and we used it to recognize passport data, policies, and Russian-language handwritten texts. The complexity of the project of the Ministry of Health was that, in addition to the simple data of the questionnaires, the documents contained the names of diseases and notes of doctors made in Latin. The algorithm simply "didn't recognize" the words or tried to fit them to similar Russian transliterated or English words. Fortunately, this is too obvious a mistake to notice it too late – it was fixed very quickly, but we still threw more resources into validation and additional verification than we initially expected. Otherwise, this is an example of a fairly smooth course of work and a "problem-free" case: yes, it is very difficult to process 40 million documents, even technically, but the mechanics and algorithms differ little from smaller projects. As a small digression from the topic: when someone says that the project is small, it will be possible to do it many times faster, this is not a reason to take your word for it. Often it takes the same amount of time or slightly less.

The next stage of the Monitor-Bio project was data processing. The NDA only allows us to list their main types: these are data on past illnesses, vaccinations, and geodata. The list of infections and diseases contains more than 20 items: from influenza and chickenpox to tuberculosis, hepatitis, encephalitis and meningococcal infection. Using the entire data set (plus information provided by other agencies, as well as Marketing Logic's own data showcase) , we trained neural networks for several months to determine the estimated rate of infection spread in each region and city of the country, depending on various variables.

This case study is largely about how big data and its diversity help to build complex forecasts and save thousands of lives. It can also be shown as an illustration of how important "related" information is. For example, the way of life (the habit of exchanging kisses when meeting), a developed network of roads, traffic and ... accelerate the spread of infection by ... times. The business has a similar story: the development of the network, customer behavior, purchasing decisions, and employee performance are influenced by thousands of factors, and the more of them are analyzed,the higher the accuracy of the forecast or decision. Where a condition cannot be represented by arrays of data, it becomes a coefficient that also increases accuracy.

Result

Processing a large amount of data, digitizing it and enriching it with other, non-obvious layers of information made it possible to build a system for generating high-accuracy forecasts. As a conclusion, we offer two useful ideas: almost always use the maximum data, and, secondly, do not limit the system in the types of data to analyze. Due to the high performance, AI sees much more relationships between them than human expertise.

CV - computer vision