COVID-19: Proving that business logic & domain knowledge is king in Data Analytics & Data Science

James Sutton
4 min readApr 12, 2020

--

The COVID pandemic has brought to light several opportunities with healthcare data in the United States. The industry is relying on sources like Johns Hopkins University CSSE, WHO, Our World in Data, COVID Tracking Project, and several other web scraping initiatives to consolidate data from each independent state for analytical purposes. Each state has their own data teams or analysts sourcing from independent labs and private companies, and publishing their own insights locally. Some states are publishing data more detailed than others; including total hospital admissions, ICU patients, patients on ventilators, and total tests administered.

In March the White House issued a call to action to the Data/Tech community for help reining in data and to develop tools that will help the science community answer high-priority questions. Kaggle launched a COVID open research dataset challenge. Many data, analytics, and BI companies started publishing insightful content showing views of growth rates, death rates, business impact trends, and even cell phone geo movements to help any interested party see the trends by state, and more recently county.

Most of the popular published insights are reporting positive cases and deaths. Over this past week, some media outlets and analysts have begun to show optimism when presenting decreases in growth rates, which has sparked many discussions on when it’s appropriate to open businesses back up and relax quarantine policies.

And when you see charts like this one, there are definitely reasons to believe the optimism:

*Chart made by Sutton Analytics LLC, powered by Looker based on Johns Hopkins CSSE data

The United States positive case count only grew by 6% yesterday, that’s good right?

Should we listen to Tomi Lahren and start the reopening of America?

Here’s the problem, without understanding the domain of the healthcare industry, how and who can access tests, how many labs are analyzing results, how many results each lab can process in a day, how those results get reported and distributed to local/state governments and then aggregated — I don’t believe that these insights alone are enough to make such impactful decisions.

The reality is that nobody knows how many cases of COVID-19 there are in the United States, or how fast it’s spreading. It’s also hard to even know how many people are dying due to COVID-19, unless they get admitted into a hospital, get tested, and it gets added to their chart. All we really know is what’s getting captured with current testing processes.

And based on testing data being published by the COVID Tracking Project, new positive results are directly correlated with the number of tests administered.

*Chart made by Sutton Analytics LLC, powered by Looker from COVID Tracking Project data

This is a very different view from the chart earlier showing a decline in growth rate. The number of total test results on a daily basis has stayed in the 130–160k range over the past week, and the number of new positive cases have stayed in the 28–35k range; a consistent 21–22% positive rate.

The growth rate on a daily basis is declining because the denominator of the calculation (total cases) is getting bigger every day, and the data collection method of the numerator is not increasing.

All over the United States there are people who feel sick and want to be tested, but don’t qualify with the current restrictions and test availability. The healthcare industry is balancing costs with patient health, so if a positive test result does not change the treatment plan, what is the benefit to administering a $1500 test? Logically that makes sense, but if the CDC, local governments, and other organizations are trying to use this data to understand the outbreak and make decisions like reducing at home orders then it’s imperative that more tests are performed. The alternative is to not invest in testing, but also not use metrics like growth rate as an indication of the success of social distancing measures.

I’m optimistic that the country’s response in the past few weeks has led to significantly lower cases, deaths, and healthcare burden compared to the alternative. However, due to data collection (testing) problems, I don’t think we know enough yet to quantify the impact or say that we’re at the peak and will start to decline.

This all highlights a major theme in data, analytics, and data science — business logic and domain knowledge will always be king in the space. Forecasting models, ML algorithms, and visualization tools have a very important place in the industry. However, their success is 100% reliant on people who understand the business processes, how the data is collected and what it means, what KPIs to measure, and what follow-up questions to ask when presented with data insights or forecast models to ensure the next decision is the right one.

Sources:

--

--