Main

Owing to the current lack of fast and reliable testing, one of the greatest challenges for preventing transmission of SARS-CoV-2 is the ability to quickly identify, trace and isolate cases before they can further spread the infection to susceptible individuals. As regions across the United States start implementing measures to reopen businesses, schools and other activities, many rely on current screening practices for COVID-19, which typically include a combination of symptom and travel-related survey questions and temperature measurements. However, this method is likely to miss pre-symptomatic or asymptomatic cases, which make up ~40–45% of those infected with SARS-CoV-2, and who can still be infectious1,2. An elevated temperature (>100 °F (>37.8 °C)) is not as common as frequently believed, being present in only 12% of individuals who tested positive for COVID-193 and just 31% of patients hospitalized with COVID-19 (at the time of admission)4.

Smartwatches and activity trackers, which are now worn by one in five Americans5, can improve our ability to objectively characterize each individual’s unique baseline for resting heart rate6, sleep7 and activity and can therefore be used to identify subtle changes in that user’s data that may indicate that they are coming down with a viral illness. Previous research from our group has shown that this method, when aggregated at the population level, can significantly improve real-time predictions for influenza-like illness8. Consequently, we created a prospective app-based research platform, called DETECT (Digital Engagement and Tracking for Early Control and Treatment), where individuals can share their sensor data, self-reported symptoms, diagnoses and electronic health record data with the aim of improving our ability to identify and track individual- and population-level viral illnesses, including COVID-19.

A previously reported study that captured symptom data in over 18,000 SARS-CoV-2-tested individuals via a smartphone-based app found that symptoms were able to help distinguish between individuals with and without COVID-191. The aim of this study is to investigate if the addition of individual changes in sensor data to symptom data can be used to improve our ability to identify COVID-19-positive versus COVID-19-negative cases among participants who self-reported symptoms.

Between 25 March and 7 June 2020, our research study enrolled 30,529 individuals, with representation from every state in the United States. Among the consented individuals, 62.0% are female and 12.8% are 65 or more years old. Of the participants, 78.4% connected their Fitbit devices to the study app, 31.2% connected the data from the Apple HealthKit, while 8.1% connected data from Google Fit (note that an individual can connect to multiple platforms). In addition, 3,811 reported at least one symptom (12.5%); of those, 54 also reported testing positive for COVID-19 and 279 reported testing negative. The numbers of days per different data type and data aggregator system are reported in Table 1, while the symptoms distribution for symptomatic individuals tested for COVID-19, or not tested, is shown in Fig. 1.

Table 1 Participants’ characteristics and device data
Fig. 1: Frequency of symptoms among participants.
figure 1

Participants who reported at least one symptom were divided into three groups: participants who tested negative for COVID-19 or positive for COVID-19 and participants who were not tested. The frequencies of the indicated symptoms in each of these three groups are shown. P values of a two-sided Fisher’s exact test applied to COVID-19-positive (54 individual subjects) and COVID-19-negative (279 individual subjects) participants are reported. Symptoms with a significant difference (P < 0.05) are marked with an asterisk.

A minority of symptomatic participants (30.3%) who tested for COVID-19 had a resting heart rate (RHR) greater than two standard deviations above the average baseline value during symptoms. The change in RHR on its own (Table 1) did not allow significant discrimination between COVID-19-positive and COVID-19-negative participants using the RHRMetric (area under the curve (AUC) of 0.52 (interquartile range (IQR): 0.41–0.64)) (Fig. 2a).

Fig. 2: Prediction of COVID-19 from self-reported symptoms and sensor data.
figure 2

af, Receiver operating characteristic curves (ROCs) for the discrimination between COVID-19-positive (54 individuals) and COVID-19-negative (279 individuals) cases based on the available data: RHR data (a); sleep data (b); activity data (c); all available sensor data (d); symptoms only (e); symptoms with sensor data (f). Models are based on a single decision threshold. Median values and 95% confidence intervals (CIs) for sensitivity (SE), specificity (SP), positive predictive value (PPV) and negative predictive value (NPV) are reported, considering the point on the ROC with the highest average value of sensitivity and specificity. Error bars represent 95% CIs. P values from a one-sided Mann–Whitney U test are reported.

Sleep and activity did show a significant difference among the two groups (Table 1), with an AUC of 0.68 (0.57–0.79) for the SleepMetric (Fig. 2b) and 0.69 (0.61–0.77) for the ActivityMetric (Fig. 2c), supporting that the sleep and activity of COVID-19 positive participants were impacted significantly more than COVID-19-negative participants. Sleep and activity are slightly correlated, with a negative correlation coefficient of −0.28, P < 0.01.

To evaluate the contribution of all the data types commonly available through personal devices, we combined the RHR, sleep and activity metrics in a single metric (SensorMetric, Fig. 2d). This improved the overall performance from the three sensor metrics to an AUC of 0.72 (0.64–0.80).

We also considered a model based only on self-reported symptoms (SymptomMetric, Fig. 2e), along with age and sex. With respect to the previously published model1, we measure a slightly lower AUC of 0.71 (0.63–0.79).

When participant-reported symptoms and sensor metrics are jointly considered in the analysis (OverallMetric, Fig. 2f), the achieved performance was significantly improved (P < 0.01) relative to either alone, with an AUC of 0.80 (0.73–0.86).

Discussion

Our results show that individual changes in physiological measures captured by most smartwatches and activity trackers are able to significantly improve the distinction between symptomatic individuals with and without a diagnosis of COVID-19 beyond symptoms alone. Although encouraging, these results are based on a relatively small sample of participants.

This work builds on our earlier retrospective analysis demonstrating the potential for consumer sensors to identify individuals with influenza-like illness, which has subsequently been replicated in a similar analysis of over 1.3 million wearable users in China for predicting COVID-198,9. In response to the COVID-19 pandemic, a number of prospective studies, led by device manufacturers and/or academic institutions, including DETECT, have accelerated deployment to allow interested individuals to voluntarily share their sensor and clinical data to help address the global crisis10,11,12,13,14. The largest of these efforts, Corona-Datenspende, was developed by the Robert Koch Institut in Germany and has enrolled over 500,000 volunteers15.

As different individuals experience a wide range of symptomatic and biological responses to infection with SARS-CoV-2, it is likely that their measurable physiological changes will also vary16,17,18. For that reason, it is possible that biometric changes may be more valuable in identifying those at highest risk for decompensation rather than just a dichotomous distinction in infection status. Because of the limited testing in the United States, especially early in the spread of the COVID-19 pandemic, individuals with more severe symptoms may have been more likely to be tested. In fact, the majority of symptomatic participants in our study did not undergo testing. However, using the optimal tradeoff of sensitivity and specificity on the ROC, we would predict that, of the 3,478 symptomatic participants who did not undergo diagnostic testing, 1,061 would have tested positive. Consequently, the ability to differentiate between COVID-19-positive and COVID-19-negative cases based on symptoms and sensor data may change over time as testing increases, and as other upper respiratory illnesses such as seasonal influenza increase this fall.

The early identification of symptomatic and pre-symptomatic infected individuals would be especially valuable as transmission is common and people may potentially be even more infectious during this period19,20,21. Even when individuals have no symptoms, there is evidence that the majority have lung injury (according to computed tomography (CT) scans), and a large number have abnormalities in inflammatory markers, blood cell counts and liver enzymes18,22,23,24. As the depth and diversity of data types from personal sensors continue to expand—such as heart rate variability (HRV), respiratory rate, temperature, oxygen saturation and even continuous blood pressure, cardiac output and systemic vascular resistance—the ability to detect subtle individual changes in response to early infectious insults will potentially improve and enable the identification of individuals without symptoms.

In the past, the normality of a specific biometric parameter, such as RHR, duration of nightly sleep or daily activity, was based on population norms. For example, a normal RHR is generally considered anything between ~60–100 b.p.m. However, recent work looking at individual daily RHRs over two years found that each person has a relatively consistent RHR, for them, that fluctuates by a median of only 3 b.p.m. weekly6. On the other hand, what would be considered a normal RHR for an individual can vary by as much as 70 b.p.m. (between 40 and 109 b.p.m.) between individuals. The potential value in identifying important changes in an individual’s RHR as an early marker for COVID-19 infection is suggested by the description of 5,700 patients hospitalized with COVID-194: at the time of admission, a greater percentage of individuals had a heart rate of >100 b.p.m. (43.1%) than had a fever (30.7%). Similarly, work in primate models of other viral and bacterial infections found that a significant increase in heart rate can be detected ~2 days before a fever25.

Just as individuals have heart rate patterns that are unique to them, the same is true for sleep patterns. Although population norms for sleep duration have been defined by one-time survey data26, longitudinal analysis of daily sleep over several years supports much greater variation in what is normal for a specific individual7. Recognizing what is normal for an individual enables much earlier detection of deviations from that normal.

A strategy of test, trace and isolate has played a central role in helping control the spread of COVID-19. However, testing comes with many challenges, including the enormous logistical and cost hurdles of recurrently testing asymptomatic individuals. In addition, testing in a population with very low prevalence can lead to a high proportion of false positive cases. A refined predictive model, based on personal sensors, could enable an early, individualized testing strategy to improve performance and lower costs. Early testing may make the use of a contact tracing app more effective by identifying positive cases in advance and allowing for early isolation.

DETECT (and similar studies) also represent the transitioning of research from a dependence on brick and mortar research centers to a remote, direct-to-participant approach now possible through a range of digital technologies, including an ever expanding collection of sensors, applications of machine learning to massive datasets, and the ubiquitous connectivity that enables rapid two-way communications 24/727,28. The promise of digital technologies is that their evolution will continue to bring us closer to identifying the best combination of measures and associated algorithms that identify infection with SARS-CoV-2 or other pathogens. However, it is equally critical to develop and continuously improve on an engaging digital platform that provides value to participants and researchers. This has proven to be extremely challenging, with a recent analysis of eight different digital research programs involving 100,000 participants having a median duration of retention of only 5.5 days29. Digital trials such as DETECT also do come with unique challenges to assure privacy and security, which can only be dealt with by effectively informing participants before consent, storing the data with the appropriate level of security and providing access to the data only for research purposes30. App-based contact tracing, which is not part of DETECT, is an especially sensitive and ethically complicated use of digital technology that can be used to address the pandemic31.

Our analyses are dependent entirely on participant-reported symptoms and testing results, as well as the biometric data from their personal devices. Although this is not consistent with the historically more common direct collection of information in a controlled laboratory setting or via electronic health records, previous work has confirmed their value and their accuracy beyond data routinely captured during routine care32,33,34. Additionally, individuals owning a smartwatch or activity tracker and having access to COVID-19 diagnostic testing are unlikely to be representative of the general population and may exclude those most affected by COVID. Although a recent survey found no racial or ethnic variation in smartwatch or activity tracker usage (23%, 26% and 21% for Black, Hispanic and White individuals, respectively), the lowest percentage of users were identified in those with the lowest annual earnings (12%), the lowest educational attainment (15%) and in those over age 50 (17%)5. In the future, if the value of wearable devices to improve individual health is confirmed, this gap in usage will need to be proactively addressed to assure health equity. The decreasing cost of these devices, some now less than US$35, will help decrease the financial barriers to accomplishing this. Finally, in the early version of the DETECT app we were not able to track the duration or trajectory of individual symptoms, care received and eventual outcomes.

These results suggest that sensor data can incrementally improve symptom-only-based models to differentiate between COVID-19-positive and COVID-19-negative symptomatic individuals, with the potential to enhance our ability to identify a cluster before more spread occurs. Such a passive monitoring strategy may be complementary to virus testing, which is generally a one-off, or infrequent, sampling assay.

Methods

Study population

Any person living in the United States over the age of 18 years old is eligible to participate in the DETECT study by downloading the iOS or Android research app, MyDataHelps. After consenting into the study, participants are asked to share their personal device data (including historical data collected prior to enrollment), report symptoms and diagnostic test results, and connect their electronic health records. Participants can opt to share as much or as little data as they like. Data can be pulled in via direct application programming interface (API) with Fitbit devices, and any device connected through Apple HealthKit or Google Fit data aggregators. Participants were recruited via the study website (www.detectstudy.org), media reports and outreach from our partners at Fitbit, Walgreens, CVS/Aetna and others.

Ethical considerations

The protocol for this study was reviewed and approved by the Scripps Office for the Protection of Research Subjects (IRB 20–7531). All individuals participating in the study provided informed consent electronically.

Statistical analysis

Only participants with self-reported symptoms and COVID-19 test results were considered in this analysis. For each participant, two sets of data were extracted: the baseline data, which included signals spanning from 21 to 7 days before the reported start date of symptoms, and the test data, which included signals beginning at the first date of symptoms to seven days after symptoms. Three types of data were considered from personal sensors: daily resting heart rate (DailyRHR), sleep duration in minutes (DailySleep) and activity based on daily total step count (DailyActivity). The daily resting heart rate is calculated by the specific device35. The total amount of sleep for a given day was based on the total period of sleep between 12 noon of the current day to 12 noon of the next day. When multiple devices from the same individual provided the same information, Fitbit device data were prioritized, for consistency. Overlapping data were combined minute by minute, before aggregating for the whole day.

A single baseline value per individual was extracted for each data type by considering the median value over the individual’s baseline data. This value is representative of a participant’s ‘normal’ before the reported symptoms. The baseline value was compared to the test data as follows:

$${\mathrm{RHRMetric}} = \frac{{{\mathrm{max}}\left( {{\mathrm{DailyRHR}}\left[ {{\mathrm{test}}\,{\mathrm{data}}} \right]} \right) - {\mathrm{median}}\left( {{\mathrm{DailyRHR}}\left[ {\mathrm{baseline}}\,{\mathrm{data}} \right]} \right)}}{{4.00}}$$
$${\mathrm{SleepMetric}} = \frac{{{\mathrm{mean}}\left( {{\mathrm{DailySleep}}\left[ {\mathrm{{test}}\,{\mathrm{data}}} \right]} \right) - {\mathrm{median}}\left( {{\mathrm{DailySleep}}\left[ {\mathrm{{baseline}}\,{\mathrm{data}}} \right]} \right)}}{{56.06}}$$
$$\begin{array}{l} {\mathrm{ActivityMetric}}\hfill{} \\ = \displaystyle\frac{{{\mathrm{mean}}\left( {{\mathrm{DailyActivity}}\left[ {\mathrm{{test}}\,{\mathrm{data}}} \right]} \right) - {\mathrm{median}}\left( {{\mathrm{DailyActivity}}\left[ {\mathrm{{baseline}}\,{\mathrm{data}}} \right]} \right)}}{{2,489.85}}\hfill{}\end{array}$$

Values were normalized to have a unitary IQR using normalization parameters calculated on all data recorded. For all these metrics, values close to zero indicate small variations from baseline values. This allows us to focus on intra-individual changes, which are minimally affected by the inter-individual variability due to the specific sensor’s hardware and estimation algorithms. For the metric based on symptoms only, we adapted the results from the study by Menni et al.1 to our available data:

$$\begin{array}{l}{\mathrm{SymptomMetric}} = - 1.32 - \left( {0.01 \times {\mathrm{age}}} \right) + \left( {0.44 \times {\mathrm{gender}}\left( {{\mathrm{male}} = 1;{\mathrm{female}} = 0} \right)} \right)\hfill{}\\ \quad + \left( {1.75 \times {\mathrm{DecreaseInTasteSmell}}} \right) + \left( {0.31 \times {\mathrm{Cough}}} \right) + \left( {0.49 \times {\mathrm{Fatigue}}} \right)\hfill{}\end{array}$$

The multivariate logistic regression model from Menni et al. combined symptoms, age and gender to predict an infection. The parameters were optimized by the authors on a large dataset including over 2 million people, 18,401 of which had undergone a COVID-19 test.

A simple manual metric aggregation strategy without optimization was used to enable a clear understanding of the benefits provided when data from multiple sources were considered together. The aggregated metrics were

$${\mathrm{SensorMetric}} = {\mathrm{RHRMetric}}/10 + {\mathrm{SleepMetric}} - {\mathrm{ActivityMetric}}$$
$${\mathrm{OverallMetric}} = {\mathrm{SensorMetric}} + {\mathrm{SymptomMetric}}$$

The main outcomes are ROC curves for each of the proposed metrics. The curves are obtained by considering a binary classification task between participants self-reported as COVID-19-positive and COVID-19-negative. The models are based on a single decision threshold, which is directly compared to the metric values, with the aim of minimizing overfitting issues while providing a fair comparison. Confidence intervals, reported with a confidence level of 95%, are estimated using a bootstrap method by repeatedly sampling the dataset with replacement. The sampling is performed in a stratified manner; that is, the balance of the classes is maintained over all experiments. Values for sensitivity (SE), specificity (SP), positive predictive value (PPV) and negative predictive value (NPV) were also calculated (Fig. 2). SE and SP are defined as the fraction of positive and negative individuals correctly classified, respectively, while PPV and NPV are the fraction of individuals predicted as positive and negative that are correctly classified, respectively. These values are based on the point in the ROC with the optimal tradeoff between sensitivity and specificity, which may vary depending on the shape of the curve. For each metric analyzed, we applied the one-sided Mann–Whitney U test with the alternate hypothesis that the underlying model of the positive class is stochastically greater than the negative class. All statistical tests were evaluated using the Python package scipy version 1.5.2. The comparison metric to assess the overall performance was the AUC of the ROC.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.