Releases: COVID19Tracking/covid-tracking-data
TACO 2021-05-31
TACO release for 2021-05-31
There are 3 variants for this release:
- Official (
taco_official.zip
file) - Complete (
taco_complete.zip
file) - Research (
taco_research.zip
file)
States COVID-19 Time Series History 20210531
States COVID-19 Time Series History 2021-05-31
This release includes the raw time series data fetched by The COVID Tracking Project from states that provide such data directly (through data portals, CSV/excel files, etc.).
Description
This is a snapshot of the CTP States COVID-19 Time Series history dataset, taken on June 1, 2021 and including all data up to May 31, 2021.
This dataset includes full time series for cases, tests and death metrics, from states that provide such data, that are fetched daily.
This is an append-only dataset, meaning that when a time series is fetched, it's tagged with the date on which it was fetched, and the data will not be overwritten again. The next day, when the time series for the same metric is fetched, it's tagged with a different fetch timestamp. This allows us to examine changes in daily values as new data is amended to previous values.
The data is tagged and organized into the same field names used by CTP APIs.
Content
This release comes in 2 varients: statescovid19.zip
with a single CSV file containing all data for all states, and statescovid19_by_state.zip
with the same data, broken down to files by state.
The files are:
Avocado was the internal codename for the project of snapshoting historic data.
avocado_schema.sql
: DB schema for avocado tableavocado_complete.csv
or{state}_avocado_complete.csv
(per state files): data for avocado table
We stored it in a relational database, and the schema is provided in avocado_schema.sql
, but it's not a requirement to use the data. The data is in avocado_complete.csv
(or the individual state file), which can be processed with any library or tool that supports CSV (pandas
, uploadnig to BigQuery
, etc.).
CSV fields are:
state -- 2 letter state abbreviation (e.g., MA)
date_used -- string representing the dating scheme (e.g., Specimen Collection)
timestamp -- date this data point refers to
fetch_timestamp -- date this data point was fetched on
date -- CTP-style string date (e.g., 20200513)
-- The rest of the fields are the same as CTP API
positive
positiveCasesViral
probableCases
death
deathConfirmed
deathProbable
total
totalTestsAntibody
positiveTestsAntibody
negativeTestsAntibody
totalTestsViral
positiveTestsViral
negativeTestsViral
totalTestEncountersViral
totalTestsAntigen
positiveTestsAntigen
negativeTestsAntigen
Each metric value is tagged with its state
, timestamp
, date_used
and fetch_timestamp
.
date_used
is the dating scheme that defines the metric. For testing, common dating schemes are:Specimen Collection
andTest Result
, for cases, common dating schemes areSpecimen Collection
,Test Result
andIllness Onset
, and for death, a common dating schemeDeath
specifying date of death.timestamp
is the state assigned timestamp to this datapointfetch_timestamp
is the timestamp when we collected the data from the state. For eachfetch_timestamp
we'll have the entire time series as it was fetched on that day.
Processing
The processing that went into the metrics presented here were minimal:
- Mapping states metric names into CTP names
- Calculating cumulative sums when the state provides only daily values. This is a limitation for states that provide only daily numbers without the beginning of the time series (e.g., ID)
- Cleanup of dates that happened before year 2019, as it's likely a mistake in data input -- reporting them on 2020-01-01 (e.g., MO)
Examples
Test Results
Number of daily tests and test results is a metric that continuously udpates because of different lab reporting schedules, reporting delays, and processing (getting the test results) times.
We can use this data to show the continuous updates to daily testing numbers. In this example, PCR testing in Washington state.
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib import cm
from matplotlib import rc
import numpy as np
import pandas as pd
df = pd.read_csv('wa_avocado_complete.csv', parse_dates=['fetch_timestamp', 'timestamp'])
# Use only the tests
tests = df[df['date_used'] == 'Specimen Collection']
tests['tests'] = tests['positiveTestsViral'] + tests['negativeTestsViral']
# Look at the data for February, 2021 reported on February and March of 2021
tests = tests[(tests['fetch_timestamp'] < datetime(2021,4,1)) &
(tests['fetch_timestamp'] >= datetime(2021, 2, 1))
].pivot_table(index='timestamp', columns='fetch_timestamp', values='tests'
).loc[datetime(2021,2,1):datetime(2021,3,1)]
# Animate the results
class LineAnimation:
def __init__(self, ax, data):
self.lines = [ax.plot([], [], color=cm.Blues_r(np.linspace(0, 1, 10)[l]))[0] for l in range(10)]
self.ax = ax
self.data = data
self.x = data.index
# Set up plot parameters
self.ax.set_xlim(self.data.index.min(), self.data.index.max())
self.ax.set_ylim(4000000, 6000000)
self.fetch_timestamp = ax.text(0.05, 0.9, '', transform=ax.transAxes)
def __call__(self, i):
# fill all lines:
for line_index, col_index in enumerate(range(i, max(i-10, -1), -1)):
self.lines[line_index].set_data(self.data.index, self.data.iloc[:, col_index])
self.fetch_timestamp.set_text(self.data.columns[i])
return self.lines[0],
fig, ax = plt.subplots(figsize = (21, 9))
all_these_lines = FuncAnimation(fig, LineAnimation(ax, tests), frames=len(tests.columns),
interval=240, repeat=False, blit=True)
all_these_lines.save('wa_february_2021_testing.gif')
Death Reporting
Accurate death reporting takes time, and the real-time data states report is always incomplete.
We can compare this preliminary data reported by states and collected by the COVID Tracking Projct to the revised data states publish
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib import rc
from matplotlib.animation import FuncAnimation
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
ctp_df = pd.read_csv(
'https://api.covidtracking.com/v1/states/oh/daily.csv',
parse_dates=['date'], index_col='date', usecols=['date', 'death'])
latest_df = pd.read_csv(
'oh_avocado_complete.csv',
parse_dates=['timestamp', 'fetch_timestamp'], usecols=['fetch_timestamp', 'timestamp', 'death', 'date_used'])
# Get only the most recent time series for death by day of death
latest_df = latest_df[(latest_df['fetch_timestamp'] == latest_df['fetch_timestamp'].max()) & (latest_df['date_used'] == 'Death')
].drop(columns=['fetch_timestamp', 'date_used']).set_index('timestamp')
# Concat the two series, look at daily diff, and use only 2020 data
df = pd.concat([latest_df, ctp_df], axis=1).diff().loc[:datetime(2021,1,1)]
df.columns = ['Latest', 'Reported']
fig, ax = plt.subplots(figsize=(21, 9))
ax.bar(df.index, df['Reported'], width=1, linewidth=0, color=cm.Blues(0.3))
ax.plot(df.index, df['Latest'], lw=3, color=cm.Blues(0.8))
ax.set_title('Daily COVID19 deaths in Ohio in 2020', ha='center', size=22)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))
ax.legend(loc='upper left', labels=['Death by date of death (Most Recent)', 'Reported by the state and collected by CTP'])
plt.margins(x=0)
plt.savefig('oh_deaths.png')
Other Resources
States COVID-19 Time Series History 20210430
States COVID-19 Time Series History 2021-04-30
This release includes the raw time series data fetched by The COVID Tracking Project from states that provide such data directly (through data portals, CSV/excel files, etc.).
Description
This is a snapshot of the CTP States COVID-19 Time Series history dataset, taken on April 30, 2021.
This dataset includes full time series for cases, tests and death metrics, from states that provide such data, that are fetched daily.
This is an append-only dataset, meaning that when a time series is fetched, it's tagged with the date on which it was fetched, and the data will not be overwritten again. The next day, when the time series for the same metric is fetched, it's tagged with a different fetch timestamp. This allows us to examine changes in daily values as new data is amended to previous values.
The data is tagged and organized into the same field names used by CTP APIs.
Content
The release consists of one file: statescovid19.tar.gz
(avocado was the codename for this project), with the following files:
- avocado_schema.sql
- avocado_complete.csv
We stored it in a relational database, and the schema is provided in avocado_schema.sql
, but it's not a requirement to use the data. The data is in avocado_complete.csv
, which can be processed with any library or tool that supports CSV (pandas
, uploadnig to BigQuery
, etc.).
CSV fields are:
state -- 2 letter state abbreviation (e.g., MA)
date_used -- string representing the dating scheme (e.g., Specimen Collection)
timestamp -- date this data point refers to
fetch_timestamp -- date this data point was fetched on
date -- CTP-style string date (e.g., 20200513)
-- The rest of the fields are the same as CTP API
positive
positiveCasesViral
probableCases
death
deathConfirmed
deathProbable
total
totalTestsAntibody
positiveTestsAntibody
negativeTestsAntibody
totalTestsViral
positiveTestsViral
negativeTestsViral
totalTestEncountersViral
totalTestsAntigen
positiveTestsAntigen
negativeTestsAntigen
Each metric value is tagged with its state
, timestamp
, date_used
and fetch_timestamp
.
date_used
is the dating scheme that defines the metric. For testing, common dating schemes are:Specimen Collection
andTest Result
, for cases, common dating schemes areSpecimen Collection
,Test Result
andIllness Onset
, and for death, a common dating schemeDeath
specifying date of death.timestamp
is the state assigned timestamp to this datapointfetch_timestamp
is the timestamp when we collected the data from the state. For eachfetch_timestamp
we'll have the entire time series as it was fetched on that day.
Processing
The processing that went into the metrics presented here were minimal:
- Mapping states metric names into CTP names
- Calculating cumulative sums when the state provides only daily values. This is a limitation for states that provide only daily numbers without the beginning of the time series (e.g., ID)
- Cleanup of dates that happened before year 2019, as it's likely a mistake in data input -- reporting them on 2020-01-01 (e.g., MO)
TACO official 2021-04-30
taco_20210430 Updating public spreadsheet CSV backups
TACO complete 2021-04-30
taco Updating public spreadsheet CSV backups