Back to Journals » Clinical Epidemiology » Volume 14

Unraveling COVID-19: A Large-Scale Characterization of 4.5 Million COVID-19 Cases Using CHARYBDIS



Kristin Kostka,1,2 Talita Duarte-Salles,3 Albert Prats-Uribe,4 Anthony G Sena,5,6 Andrea Pistillo,3 Sara Khalid,4 Lana YH Lai,7 Asieh Golozar,8,9 Thamir M Alshammari,10 Dalia M Dawoud,11 Fredrik Nyberg,12 Adam B Wilcox,13,14 Alan Andryc,5 Andrew Williams,15 Anna Ostropolets,16 Carlos Areia,17 Chi Young Jung,18 Christopher A Harle,19 Christian G Reich,1,2 Clair Blacketer,5,6 Daniel R Morales,20 David A Dorr,21 Edward Burn,3,4 Elena Roel,3,22 Eng Hooi Tan,4 Evan Minty,23 Frank DeFalco,5 Gabriel de Maeztu,24 Gigi Lipori,19 Hiba Alghoul,25 Hong Zhu,26 Jason A Thomas,13 Jiang Bian,19 Jimyung Park,27 Jordi Martínez Roldán,28 Jose D Posada,29 Juan M Banda,30 Juan P Horcajada,31 Julianna Kohler,32 Karishma Shah,33 Karthik Natarajan,16,34 Kristine E Lynch,35,36 Li Liu,37 Lisa M Schilling,38 Martina Recalde,3,22 Matthew Spotnitz,14 Mengchun Gong,39 Michael E Matheny,40,41 Neus Valveny,42 Nicole G Weiskopf,21 Nigam Shah,29 Osaid Alser,43 Paula Casajust,42 Rae Woong Park,27,44 Robert Schuff,21 Sarah Seager,1 Scott L DuVall,35,36 Seng Chan You,45 Seokyoung Song,46 Sergio Fernández-Bertolín,3 Stephen Fortin,5 Tanja Magoc,19 Thomas Falconer,16 Vignesh Subbian,47 Vojtech Huser,48 Waheed-Ul-Rahman Ahmed,33,49 William Carter,38 Yin Guan,50 Yankuic Galvan,19 Xing He,19 Peter R Rijnbeek,6 George Hripcsak,16,34 Patrick B Ryan,5,16 Marc A Suchard,51 Daniel Prieto-Alhambra4

1IQVIA, Cambridge, MA, USA; 2OHDSI Center at The Roux Institute, Northeastern University, Portland, ME, USA; 3Fundació Institut Universitari per a la recerca a l’Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain; 4Centre for Statistics in Medicine, NDORMS, University of Oxford, Oxford, UK; 5Janssen Research & Development, Titusville, NJ, USA; 6Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands; 7School of Medical Sciences, University of Manchester, Manchester, UK; 8Regeneron Pharmaceuticals, Tarrytown, NY, USA; 9Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA; 10College of Pharmacy, Riyadh Elm University, Riyadh, Saudi Arabia; 11National Institute for Health and Care Excellence, London, UK; 12School of Public Health and Community Medicine, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden; 13Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA; 14Unviersity of Washington Medicine, Seattle, WA, USA; 15Tufts Institute for Clinical Research and Health Policy Studies, Boston, MA, USA; 16Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA; 17Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, UK; 18Division of Respiratory and Critical Care Medicine, Department of Internal Medicine, Daegu Catholic University Medical Center, Daegu, South Korea; 19University of Florida Health, Gainesville, FL, USA; 20Division of Population Health and Genomics, University of Dundee, Dundee, UK; 21Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; 22Universitat Autònoma de Barcelona, Barcelona, Spain; 23O’Brien Institute for Public Health, Faculty of Medicine, University of Calgary, Calgary, Canada; 24IOMED, Barcelona, Spain; 25Faculty of Medicine, Islamic University of Gaza, Gaza, Palestine; 26Nanfang Hospital, Southern Medical University, Guangzhou, People’s Republic of China; 27Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, South Korea; 28Director of Innovation and Digital Transformation, Hospital del Mar, Barcelona, Spain; 29Department of Medicine, School of Medicine, Stanford University, Redwood City, CA, USA; 30Georgia State University, Department of Computer Science, Atlanta, GA, USA; 31Department of Infectious Diseases, Hospital del Mar, Institut Hospital del Mar d’Investigació Mèdica (IMIM), Universitat Autònoma de Barcelona, Universitat Pompeu Fabra, Barcelona, Spain; 32United States Agency for International Development, Washington, DC, USA; 33Botnar Research Centre, NDORMS, University of Oxford, Oxford, UK; 34New York-Presbyterian Hospital, New York, NY, USA; 35VA Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT, USA; 36Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT, USA; 37Biomedical Big Data Center, Nanfang Hospital, Southern Medical University, Guangzhou, People’s Republic of China; 38Data Science to Patient Value Program, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA; 39Institute of Health Management, Southern Medical University, Guangzhou, People’s Republic of China; 40Tennessee Valley Healthcare System, Veterans Affairs Medical Center, Nashville, TN, USA; 41Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; 42Real-World Evidence, TFS, Barcelona, Spain; 43Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; 44Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, South Korea; 45Department of Preventive Medicine, Yonsei University College of Medicine, Seoul, South Korea; 46Department of Anesthesiology and Pain Medicine, Catholic University of Daegu, School of Medicine, Daegu, South Korea; 47College of Engineering, The University of Arizona, Tucson, AZ, USA; 48National Library of Medicine, National Institutes of Health, Bethesda, MD, USA; 49College of Medicine and Health, University of Exeter, St Luke’s Campus, Exeter, UK; 50DHC Technologies Co. Ltd., Beijing, People’s Republic of China; 51Departments of Biostatistics, Computational Medicine, and Human Genetics, University of California, Los Angeles, CA, USA

Correspondence: Daniel Prieto-Alhambra, Botnar Research Centre, Windmill Road, Oxford, OX37LD, UK, Email [email protected]

Purpose: Routinely collected real world data (RWD) have great utility in aiding the novel coronavirus disease (COVID-19) pandemic response. Here we present the international Observational Health Data Sciences and Informatics (OHDSI) Characterizing Health Associated Risks and Your Baseline Disease In SARS-COV-2 (CHARYBDIS) framework for standardisation and analysis of COVID-19 RWD.
Patients and Methods: We conducted a descriptive retrospective database study using a federated network of data partners in the United States, Europe (the Netherlands, Spain, the UK, Germany, France and Italy) and Asia (South Korea and China). The study protocol and analytical package were released on 11th June 2020 and are iteratively updated via GitHub. We identified three non-mutually exclusive cohorts of 4,537,153 individuals with a clinical COVID-19 diagnosis or positive test, 886,193 hospitalized with COVID-19, and 113,627 hospitalized with COVID-19 requiring intensive services.
Results: We aggregated over 22,000 unique characteristics describing patients with COVID-19. All comorbidities, symptoms, medications, and outcomes are described by cohort in aggregate counts and are readily available online. Globally, we observed similarities in the USA and Europe: more women diagnosed than men but more men hospitalized than women, most diagnosed cases between 25 and 60 years of age versus most hospitalized cases between 60 and 80 years of age. South Korea differed with more women than men hospitalized. Common comorbidities included type 2 diabetes, hypertension, chronic kidney disease and heart disease. Common presenting symptoms were dyspnea, cough and fever. Symptom data availability was more common in hospitalized cohorts than diagnosed.
Conclusion: We constructed a global, multi-centre view to describe trends in COVID-19 progression, management and evolution over time. By characterising baseline variability in patients and geography, our work provides critical context that may otherwise be misconstrued as data quality issues. This is important as we perform studies on adverse events of special interest in COVID-19 vaccine surveillance.

Keywords: OHDSI, OMOP CDM, descriptive epidemiology, real world data, real world evidence, open science

Introduction

The World Health Organization (WHO) declared the coronavirus disease 2019 (COVID-19) pandemic on 11 March 2020 after 118,000 reported cases in over 110 countries.5 By the end of 2021, the number of COVID-19 cases increased to over 278 million cases globally, and the death toll exceeded 5 million.6 Thousands of publications have attempted to aid our scientific understanding of this public health emergency.7,8

Characterisation studies, called descriptive epidemiology, provide an important context into our understanding of disease by describing the basic attributes of who gets sick and in what context. The initial body of COVID-19 characterisation work gave researchers information on the stark difference in the perception of the novel coronavirus compared to flu-like illnesses: patients were male, younger, and with fewer concurrent comorbidities and less documented prior medication use.9

Utilising routinely collected real world data (RWD) can be a powerful asset for understanding an evolving pandemic response.1,2 Each data source provides novel information, be it the geographic variability of COVID-19, the impact of varying government strategies to contain spread or the evolution of treatment protocols. With extensive heterogeneity in public health strategies and clinical care across the world,10 a large repeated multi-center study to describe disease across locations, practices, and populations, but that holds data analysis constant would go far in determining what factors impact observed differences.

RWD networks are vital in helping to understand the magnitude of the problem, and developing possibly mitigating strategies both globally and locally.11,12 Here we present the global Observational Health Data Sciences and Informatics (OHDSI) community, an international open-science initiative of more than 3500 collaborators from 34 countries, response to the COVID-19 pandemic.3 Founded in 2015, the OHDSI data network enabled a rapid baseline understanding of COVID-19 in emerging hotspots (United States of America [USA], Spain and South Korea).9 Our work evolved into a systematic framework for analysing and reporting COVID-19 RWD that we call Characterizing Health Associated Risks, and Your Baseline Disease In SARS-COV-2 (CHARYBDIS).

CHARYBDIS offers multiple insights into COVID-19 clinical presentations, management and progression. Herein we aim to describe baseline demographics, clinical characteristics, treatments received, and outcomes among individuals diagnosed and hospitalized with COVID-19 in actual practice settings in nine countries from three continents. These data reflect an international community of research collaborators who are working to advance retrospective database research in RWD for COVID-19. Our body of research is freely available, foundational result set that can provide benchmarks in how COVID-19 manifests over time including its inevitable evolution as we roll-out additional vaccines and treatments.

Methods

Study Design, Setting and Data Sources

We conducted a descriptive retrospective database study using a federated network of data partners in the USA, Europe (the Netherlands, Spain, the UK, Germany, France and Italy) and Asia (South Korea and China). Each data partner mapped their source system to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM).13–15 The use of a CDM ensured shared conventions, including consistent representation of clinical terms across coding systems. We assessed the plausibility, conformance and completeness of each contributing database using a common data quality tool for repeated assessment and monitoring the adherence to conventions across the network.16,17 We ensured technical reproducibility by using the same package of analytical code for all contributing data partners.18

The study protocol and analytical package were released on 11 June 2020 and iterative updates have continued to be released via GitHub: https://github.com/ohdsi-studies/Covid19CharacterizationCharybdis.4 23 real world healthcare databases contributed to the CHARYBDIS study (Supplementary Table 1). Contributing institutes ranged from major academic medical centers to small community hospitals from across three continents. Date capture ranged from December 2019 to as recent as January 2021 (site specific dates in Supplementary Table 1). Prior to performing these analyses, all the data partners obtained Institutional Review Board (IRB) or equivalent governance approval. Each data partner executed the study package locally on their OMOP CDM. Only aggregate results from each database were publicly shared. Minimum cell sizes were determined by institutional protocols. All data partners consented to the external sharing of the result set on data.ohdsi.org.

Study Population and Follow-Up

We focused on three non-mutually exclusive COVID-19 cohorts: i) diagnosed with COVID-19 (a positive SARS-CoV-2 laboratory test or clinical diagnosis code documenting COVID-19 - earliest event served as the index date); ii) hospitalized with COVID-19 and; iii) hospitalized with COVID-19 and requiring intensive services. Due to variability in access to diagnostic testing, we specifically looked for the presence of a PCR or antigen laboratory test OR the use of clinical diagnosis codes documenting COVID-19 presentation.19 The codes used to identify cohorts and more detail on the definitions of the above cohorts can be found in Supplementary Table 2. These cohorts were generated both with a requirement of at least 365 days of data availability prior to the index date, and without any requirement for prior observation time. Databases created specifically for COVID-19 tracking may be unable to support extensive lookback periods and thus, we used multiple definitions to ensure inclusiveness in our approach. Cohorts were followed from their cohort-specific index date to the earliest of death, end of the observation period, and up to 30 days post-index.

Stratifications

Each cohort was analyzed by the overall study population and stratified by additional available characteristics including: follow-up time; socio-demographics, baseline comorbidities, pregnancy status (yes/no), and flu-like symptom episodes (yes/no). Detailed definitions of each stratification are available in Supplementary Table 2.

Baseline Characteristics, Symptoms, Medication Use and Outcomes of Interest

Information on socio-demographics was identified at or before baseline (index date). All conditions, symptoms and medications were identified and described at four different time intervals (1 year prior, 30 days prior, at index and up to 30 days after index). The definition of each symptom and outcome is provided in Supplementary Table 2.

Statistical Analysis

We built this analysis using Health Analytics Data-to-Evidence Suite (HADES), a set of open source R packages for large scale analytics.20 Proportions, standard deviations (SD), and standardized mean differences (SMD) within each subgroup were tabulated as pre-specified in our study protocol. This analysis was descriptive in nature with the explicit intention of building an initial, repeatable framework for constructing prevalent rates of disease. Only cohorts or stratified sub-cohorts with a minimum sample size of 140 subjects were characterized. This cut-off was deemed necessary to estimate with sufficient precision the prevalence of a previous condition or 30-day risk of an outcome affecting ≥10% of the study population. SMDs were plotted in Manhattan-style plots, a type of scatter plot designed to visualize large data with a distribution of higher-magnitude values. Scatter plots were also created to compare the described conditions, symptoms and demographics of patients diagnosed (Y axis) to those hospitalized (X axis) with COVID-19.

Results

Patient Characteristics

Overall, we identified three non-mutually exclusive cohorts of 4,537,153 individuals with a clinical COVID-19 diagnosis or positive test, 886,193 hospitalized with COVID-19, and 113,627 hospitalized with COVID-19 requiring intensive services (Figure 1). Of these, the cohorts including patients with the requirement of at least of 365 days before index: 3,279,518 with a clinical COVID-19 diagnosis or laboratory positive test, 636,810 hospitalized with COVID-19, and 63,636 hospitalized with COVID-19 requiring intensive services (Supplementary Tables 3 and 4).

Figure 1 COVID-19 cases across the OHDSI COVID-19 network.

Geographic Distribution

The USA data partners contributed 96% of the diagnosed with COVID-19 cohorts, including the single largest diagnosed cohort from IQVIA Open Claims (n=2,785,812). Europe contributed 4% of the diagnosed with COVID-19 cohorts, owing the single largest regional diagnosed cohort to SIDIAP-Spain (n=124,305). Asia contributed less than 1% of diagnosed with COVID-19 cohorts, with the single largest regional diagnosed cohort contributed from Daegu Catholic University Medical Center (n=599).

Demographic Distribution

In the USA, the proportion of diagnosed cases generally decreased with age, with most diagnosed cases being within the 25 to 60 age group. The proportion of cases hospitalized and intensive services increased with age, with the highest proportions of cases of hospitalized, or intensive cases in the 60 to 80 year age group (Figure 2). A slightly higher proportion of women were diagnosed than men but a greater proportion of men were hospitalized (and where available, required intensive services) than women in the USA databases. In Europe, databases captured diagnosed or hospitalized cohorts but had limited information on intensive services. In Europe, databases capturing hospitalized cases (HMAR, HM-Hospitales, SIDIAP, and SIDIAP-H) showed a similar trend to the USA databases in that there was a higher proportion of men were hospitalized than women (Supplementary Figure 1). Unlike the USA and European databases, there was also a higher proportion of women in hospitalized cases in the South Korean database (HIRA). Age-wise trends in the European and Asian databases were similar to those in the USA databases, in that the bulk of the diagnosed cases were in the 25 to 60 year age group, whilst the majority of the hospitalized cases were in the 60 to 80 year age group (Supplementary Figure 1).

Figure 2 Distribution of diagnosed, hospitalized and requiring intensive services COVID-19 cases by age and sex across the OHDSI COVID-19 network in the United States.

Abbreviations: diag, diagnosed; hosp, hospitalized; i.s., hospitalized and requiring intensive services; CU-AMC-HDC, U of Colorado Anschutz Medical Campus Health Data Compass; CUIMC, Columbia University Irving Medical Center; IQVIAHospitalCDM, IQVIA Hospital Charge Data Master; OHSU, Oregon Health and Science University; OPTUM-EHR, Optum© de-identified Electronic Health Record Dataset; OPTUM-SES, Optum® De-Identified ClInformatics® Data Mart Database – Socio-Economic Status (SES); STARR-OMOP, Stanford Medicine Research Data Repository; TRDW, Tufts MC Research Data Warehouse; UWM-CRD, UW Medicine COVID Research Dataset; VA-OMOP, Department of Veterans Affairs.

Notes: In each subplot, the x-axis represents what proportion of all women (left) and all men (right) fall in each age category. No prior observation period required in the cohorts shown in this figure Cohorts must be ≥140 people to be reported in this analysis.

Comorbidities

Overall, the proportion of patients with type 2 diabetes mellitus, hypertension, chronic kidney disease, end stage renal disease, heart disease, malignant neoplasm, obesity, dementia, auto-immune condition, chronic obstructive pulmonary disease (COPD), and asthma was higher in the hospitalized cohort as compared to the diagnosed (Tables 1 and 2). Data on tuberculosis, human immunodeficiency viruses (HIV), and hepatitis C infections were sparse, and where available the proportions were generally low (≤1%). In the US databases, the proportion of pregnant women was generally higher in the hospitalized cohort than in the diagnosed, but not so in two European databases (HM and SIDIAP). The remaining five European and one of the Asian databases had data on pregnant women only in the hospitalized cohort, the proportion of which was < 2%.

Table 1 Characteristics of Persons with a COVID-19 Diagnosis or SARS-CoV-2 Positive Test Across the OHDSI COVID-19 Network*

Table 2 Characteristics of Persons Hospitalized with a COVID-19 Diagnosis or SARS-CoV-2 Positive Test Across the OHDSI COVID-19 Network*

Other Analyses

Dyspnea, cough, and fever were the most common symptoms in diagnosed and hospitalized cohorts globally (Supplementary Table 5). Where recorded, the proportion of dyspnea and malaise/fatigue was consistently higher in the hospitalized cohort as compared to the diagnosed. Anosmia/hyposmia/dysgeusia was present in less than 1% individuals in all but one database and more common in the diagnosed than the hospitalized cohorts (Supplementary Table 6).

We further described a total of 19,222 conditions and 2973 medications registered during the year prior to the index date (Supplementary Figure 2). The same information is also described for 30 days prior to the index date, at index date, or during the first 30 days after index date (Supplementary Tables 46) The full result set of comorbidities, presenting symptoms, medications and outcomes are reported by each cohort in aggregate counts, and are available in an interactive website: https://data.ohdsi.org/Covid19CharacterizationCharybdis/.

Discussion

CHARYBDIS is the world’s largest open science aggregate result set aimed at describing the baseline demographics, clinical characteristics, treatments received, and outcomes among individuals diagnosed and hospitalized with COVID-19. To accomplish this, we aggregated over 22,000 unique characteristics creating a multi-centre view to describe trends in COVID-19 progression, management and evolution over time. Globally, we observed similarities in the USA and Europe in gender (more women diagnosed than men but more men hospitalized than women) and age (most diagnosed cases between 25–60 years of age versus most hospitalized cases between 60–80 years of age) distributions. Similar to previous studies, we observed South Korea differed with more women than men hospitalized. We found similarities in comorbidities and presenting symptoms. The large, diverse sample size allows also for the identification of populations of great interest, including children and adolescents,25 pregnant women,26 patients with a history of cancer,27 patients with a history of autoimmune disorders,28 or patterns of drug utilization in COVID-19 treatment,21 and which were the focus of additional in-depth investigations.

Summary of Key Findings

We described characteristics of 4,537,153 individuals with a clinical COVID-19 diagnosis or positive test, 886,193 hospitalized with COVID-19, and 113,627 hospitalized with COVID-19 requiring intensive services from 9 countries. Up to 22,200 unique aggregate characteristics have been produced across databases, with all made publicly available in an accompanying website. The evidence framework is a method for systematically understanding cohort-level differences in COVID-19 from different regions and different points in the pandemic. In the months since we started this effort, our network has already aided in rapid study for coagulopathy and adverse of events of special interest for COVID-19 vaccines to inform regulatory bodies.22 This research community can be a public health utility to guide in 1) better patient characterization and stratification, 2) identifying areas of gap in knowledge/evidence, and 3) generating hypotheses for future research.

Comparison to Other Multi-Centre COVID-19 Consortia

We began our deep phenotyping work through an initial investigation of persons hospitalized with COVID-19 compared to prior flu seasons in our global federated network.9

The National COVID Cohort Collaborative (N3C) is a NIH NCATS funded initiative collecting centralizing patient-level data to study patterns in COVID-19 patients.23 This effort has over 80 participating institutions contributing 4.5M COVID-19 patients to date to a centralized harmonized repository. The consortia has enabled many US institutions in adoption of common data models in COVID-19 research. 4CE is another multi-site data-sharing collaborative of 342 hospitals in the US and in Europe, utilizing i2b2 or OMOP data models.24 The hospitalization cohorts presented in 4CE cohorts remain smaller than the scope of CHARYBDIS with only 36,447 hospitalized patients with COVID-19 as of August 2020.24 Even when adjusting for cohort overlap, our work to date with CHARYBDIS is nearly triple the diagnosis and double the hospitalized cohorts represented in prior research. Our results also have more international representation across the cascade of hotspots over the course of the pandemic’s spread. As we continue our research, we are working with researchers to create inpatient-outpatient linkages and understand COVID-19 patient trajectories across care settings.

Study Strengths

Our study has several strengths. This study is unique in its approach to characterizing COVID-19 cases across an international network of healthcare systems with varied policies enacted to combat this pandemic. This allows better understanding of the implications of the pandemic for different countries and regions, in the context of an international comparison. Particularly, it provides visibility into the variability of patient characteristics across healthcare settings. This study is the most comprehensive federated network of healthcare sites in the world, creating the single largest cohort study on diagnosed and hospitalized COVID-19 cases to date. The large, diverse sample size allows for extensive investigation on subgroups of interest. CHARYBDIS is the framework for additional in-depth investigations on children and adolescents,25 pregnant women,26 patients with a history of cancer,27 patients with a history of autoimmune disorders,21 or patterns of drug utilization in COVID-19 treatment.21 The size of these results are so large, we have hundreds combinations of subgroups of interest that remain unreported. There is significant opportunity for this framework to inform additional research.

Study Limitations

We recognize there are limitations in our approach. First, this study is descriptive in nature. Further analyses are needed to utilize these findings in clinical application. The observed differences between groups (eg diagnosed versus hospitalized) should therefore not be interpreted as causal effects without further statistical scrutiny. Answering causal questions is especially difficult in COVID-19 because of the varying processes by which patients were screened, tested, admitted, and treated; the critical importance of knowing the exact timing of treatments and outcomes in severe cases; and the lack of appropriate comparison groups. Simple multivariable models by themselves will not sufficiently address bias for multiple questions and were purposely not applied here. This study was carried out using data recorded in routine clinical practice and based on electronic health records (EHRs) and/or claims data. The analysed data are therefore expected to be incomplete in some respects and may have erroneous entries, leading to potential misclassification. We have selectively reported database-specific outcomes to minimise the impact of incompleteness. We are aware that this may mean the network assembled is not inherently valuable for every follow-on analysis as each data partner may have different elements missing. Hospital encounters may be unable to ascertain outcomes experienced in an outpatient data. Our EHR partners rely on structured data and may be missing key findings from clinical notes. Additionally, the under-reporting of symptoms observed in these data is a key finding of this study, and should be taken into consideration in previous and future similar reports from “real world” cohorts. Differential reporting in different databases is likely a function of differential coding practice as well as of variability in disease severity, with milder/less symptomatic cases more likely presenting in outpatient and primary care EHR, and more severe ones in hospital databases. Finally, the current result submissions are prejudiced to data in the initial wave of COVID-19 cases. Further analysis using this network requires stratification by calendar month. Lastly, we currently lack data partners in low to middle income countries and recognize these data are lacking representation of some of the hardest hit areas in the world (eg Brazil, India). As data are accumulated over time, future updates of the results will provide the opportunity to study more recent cohorts of COVID-19 patients, who seem to have a better prognosis overall compared to those diagnosed in the first half of the pandemic.

Conclusion

We constructed a global, multi-centre view to describe trends in COVID-19 progression, management and evolution over time. By characterising baseline variability in demographics across geography, our work provides critical context to the reliability of the insight we generate. In retrospective database studies, one can struggle to identify whether heterogeneity occurs because of patient variability or because of the variability in source systems we use to capture patient data. Here we use a network of retrospective databases standardised to the same data model adhering to a shared ontology and data quality processes. Our study provides a comprehensive view into the first year of the pandemic at a scale unlike most retrospective research. Our work sheds light on the natural history of millions of COVID-19 patients from the USA, 6 European countries and 2 Asian countries. This framework is open source and available for re-use enabling a repeatable, reproducible method to capture the evolving natural history of this novel coronavirus and can be extended to other disease of international interest. We believe it is critically important to repeat and reproduce the findings we produce in real world studies. Leveraging this global federated network to corroborate single center findings can provide context to national database findings in the presence of regional variability in COVID-19 management including vaccine rollout and treatments.

Transparency Declaration

Lead authors affirm that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

Data Sharing Statement

Analyses were performed locally in compliance with all applicable data privacy laws. Although the underlying identified patient data is not readily available to be shared, authors contributing to this paper have direct access to the data sources used in this study. All results (eg aggregate statistics, not presented at a patient-level with redactions for minimum cell count) are available for public inquiry. These results are inclusive of site-identifiers by contributing data sources to enable interrogation of each contributing site. All analytic code and result sets are made available at: https://github.com/ohdsi-studies/Covid19CharacterizationCharybdis.

Ethical Approval

All the data partners received Institutional Review Board (IRB) approval or exemption. STARR-OMOP had approval from IRB Panel #8 (RB-53248) registered to Leland Stanford Junior University under the Stanford Human Research Protection Program (HRPP). The use of VA data was reviewed by the Department of Veterans Affairs Central IRB, was determined to meet the criteria for exemption under Exemption Category 4(3), and approved for Waiver of HIPAA Authorization. The research was approved by the Columbia University Institutional Review Board as an OHDSI network study. The use of SIDIAP was approved by the Clinical Research Ethics Committee of the IDIAPJGol (project code: 20/070-PCV). The use of HMAR was approved by the Parc de Salut Mar Clinical Research Ethics Committee. The use of CPRD was approved by the Independent Scientific Advisory Committee (ISAC) (protocol number 20_059RA2). This study is approved by the University of Florida IRB under protocol IRB202100175. Some databases used (HealthVerity, Premier, IQVIA Open Claims, Optum EHR, and Optum SES) in these analyses are commercially available, syndicated data assets that are licensed by contributing authors for observational research. These assets are de-identified commercially available data products that could be purchased and licensed by any researcher. The collection and de-identification of these data assets is a process that is commercial intellectual property and not privileged to the data licensees and the co-authors on this study. Licensees of these data have signed Data Use Agreements with the data vendors which detail the usage protocols for running retrospective research on these databases. All analyses performed in this study were in accordance with Data Use Agreement terms as specified by the data owners. As these data are deemed commercial assets, there is no Institutional Review Board applicable to the usage and dissemination of these result sets or required registration of the protocol with additional ethics oversight. Compliance with Data Use Agreement terms, which stipulate how these data can be used and for what purpose, is sufficient for the licensing commercial entities. Further inquiry related to the governance oversight of these assets can be made with the respective commercial entities: HealthVerity (healthverity.com), Premier (premierinc.com), IQVIA (iqvia.com) and Optum (optum.com). At no point in the course of this study were the authors of this study exposed to identified patient-level data. All result sets represent aggregate, de-identified data that are represented at a minimum cell size of >5 to reduce potential for re-identification. Furthermore, the New England Institutional Review Board of Janssen Research & Development (Raritan, NJ) has determined that studies conducted on licensed copies of Premier, Optum EHR, Optum SES and HealthVerity are exempt from study-specific IRB review, as these studies do not qualify as human subjects research.

Acknowledgments

We would like to acknowledge the patients who suffered from or died of this devastating disease, and their families and caregivers. We would also like to thank the social workers and healthcare professionals involved in the management of COVID-19 during these challenging times, from primary care to intensive care units. We also thank the database curation teams around the world including the COVIDMAR Group (R.Güerri, J.Villar, L.Sorlí, M.Montero, S.Gómez-Zorrilla, I.López-Montesinos, M.Arenas-Miras, J.Gómez-Junyent, I.Arrieta, E.Sendra, S.Castañeda, E.Letang, I.Pelegrín, A.Rial, J.Rodríguez, C.Gimenez, J.Soldado, E.García). Kristin Kostka and Talita Duarte-Salles are co-first authors for this study. Marc A Suchard and Daniel Prieto-Alhambra are co-senior authors for this study.

Author Contributions

KK, TDS, APU, AGS, AP, LL, PC, EB, VH, FN, SK, JK, AG, MAS, PR, GH, MS, AO, SD, MM, LMS, OA, CA, HA, KaS, WurA, JMB, NV, GdM, TMA, PJR, DPA contributed to the conceptualization and design of the study. KK, TDS, APU, AGS, AP, LL, PC, EB, VH, FN, SK, AG, MAS, PR, GH, MS, AO, SD, MM, LMS, NV, GdM, PJR, DPA contributed to the analysis phase of the study. KK, TDS, APU, AGS, AP, PC, SFB, EB, JAT, ABW, SK, PRR, GH, TF, KN, AA, SF, NS, JoP, AW, KL, WC, CB, FD, CR, SGY, JyP, RWP, SS, CYJ, HZ, LiL, MG, YG, YZ, PJR, DPA, DavidD, RS, NW, XH, TM, CH, GL, JB, JMR, JPH, YanG are data owners and contribute to the extract-transform-load of their data to the OMOP CDM and the analytical execution of the study package within their local environments. KK, TDS, APU, AGS, AP, LL, PC, EB, SK, MR, ER, AG, JK, MAS, PR, GH, DD, VS, TMA, EHT, EM, MAS, PJR, DPA were critical to drafting the manuscript and the overall interpreting results. All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.

Funding

The European Health Data & Evidence Network has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. This research received partial support from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC), US National Institutes of Health, US Department of Veterans Affairs, the Health Department from the Generalitat de Catalunya with a grant for research projects on SARS-CoV-2 and COVID-19 disease organized by the Direcció General de Recerca i Innovació en Salut, Janssen Research & Development, IQVIA, TFS and IOMED. The University of Oxford received funding related to this work from the Bill & Melinda Gates Foundation (Investment ID INV-016201 and INV-019257). This study was supported by National Key Research & Development Program of China (Project No.2018YFC0116901). TFS received funding related to this work from the University of Oxford. OHSU received support from Gates Foundation, INV-016910 and the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through Grant Award Number UL1TR002369. The University of Washington received a grant related to this work from the Bill & Melinda Gates Foundation (INV-016910). No funders had a direct role in this study. The views and opinions expressed are those of the authors and do not necessarily reflect those of the Clinician Scientist Award programme, NIHR, Department of Veterans Affairs or the United States Government, NHS, National Institute for Health and Care Excellence (NICE) or the Department of Health, England. The Ajou University received funding related to this work from the Bill & Melinda Gates Foundation (Investment ID INV-016284), from the Bio Industrial Strategic Technology Development Program (20003883), funded by the Ministry of Trade, Industry & Energy, and from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health & Welfare, Republic of Korea (HR16C0001).

Disclosure

Ms. Kostka was an employee of IQVIA during the conduct of this study and received grant funding from the NIH NCATS National COVID Cohort Collaborative and the Bill and Melinda Gates Foundation. Mr. Sena is an employee and holds stock at Janssen Research & Development, a Johnson and Johnson family of companies. Dr. Golozar reports personal fees from Regeneron Pharmaceuticals, outside the submitted work. She is a full-time employee at Regeneron Pharmaceuticals. This work was not conducted at Regeneron Pharmaceuticals. Dr. Nyberg was an employee of AstraZeneca until 2019 and hold some shares. Dr. Wilcox reports grants from Bill and Melinda Gates Foundation, grants from National Institute of Health, during the conduct of the study. Mr. Andryc is an employee of Janssen Research & Development, a subsidiary of Johnson & Johnson. Dr. Reich is an employee of IQVIA. Dr. Blacketer reports she is an employee and holds stock at Janssen Research & Development, a Johnson and Johnson family of companies. Dr. Morales is supported by a Wellcome Trust Clinical Research Development Fellowship (Grant 214588/Z/18/Z) and reports grants from Chief Scientist Office (CSO), grants from Health Data Research UK (HDR-UK), grants from National Institute of Health Research (NIHR), outside the submitted work. Mr. DeFalco reports he is an employee and holds stock at Janssen Research & Development, a Johnson and Johnson family of companies. Mr. Thomas reports grants from Bill and Melinda Gates Foundation (INV-016910), grants from National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through Grant Award Number UL1TR002369 to his institution, during the conduct of the study. Dr Jiang Bian reports grants from NIH/NIEHS (R21ES032762), during the conduct of the study. Dr. Posada reports grants from National Library of Medicine, during the conduct of the study. Dr. Natarajan reports grants from US NIH, during the conduct of the study. Dr. Matheny reports grants from US NIH, grants from US VA HSR&D, during the conduct of the study. Dr. Weiskopf reports personal fees from Merck, during the conduct of the study and outside the submitted work. Dr. Shah reports grants from National Library of Medicine, during the conduct of the study. Dr. Park reports grants from Ministry of Trade, Industry & Energy, Republic of Korea, grants from Ministry of Health & Welfare, Republic of Korea, grants from Bill & Melinda Gates Foundation, during the conduct of the study. Mr Robert Schuff reports grants from Gates Foundation, grants from NIH-NCATS, during the conduct of the study. Ms. Seager is an employee of IQVIA. Dr. DuVall reports grants from Anolinx, LLC, Astellas Pharma, Inc, AstraZeneca Pharmaceuticals LP, Boehringer Ingelheim International GmbH, Celgene Corporation, Eli Lilly and Company, Genentech Inc., Genomic Health, Inc., Gilead Sciences Inc., GlaxoSmithKline PLC, Innocrin Pharmaceuticals Inc., Janssen Pharmaceuticals, Inc., Kantar Health, Myriad Genetic Laboratories, Inc., Novartis International AG, and Parexel International Corporation through the University of Utah or Western Institute for Veteran Research outside the submitted work. Dr. Fortin is an employee of Janssen R&D, a subsidiary of Johnson and Johnson. Dr. Subbian reports grants from State of Arizona; Arizona Board of Regents, during the conduct of the study; grants from National Science Foundation (grant# 1838745), grants from Agency for Healthcare Research and Quality, grants from National Institutes of Health, outside the submitted work. Dr. Rijnbeek reports grants from Innovative Medicines Initiative, Janssen Research and Development, during the conduct of the study. He also works for a research institute which receives/received unconditional research grants from Yamanouchi, Pfizer-Boehringer Ingelheim, GSK, Amgen, UCB, Novartis, Astra-Zeneca, Chiesi, Janssen Research and Development, none of which relate to the content of this work. Dr. Hripcsak reports grants from US NIH and Janssen Research. Dr. Ryan is an employee of Janssen Research and Development and shareholder of Johnson & Johnson. Dr. Suchard reports grants from US National Institutes of Health, Department of Veterans Affairs, during the conduct of the study; grants and/or personal fees from IQVIA, Janssen Research and Development, US Food and Drug Administration, and Private Health Management, outside the submitted work. Dr. Prieto-Alhambra reports grants, non-financial support, speaker/consultancy services and/or advisory board membership from AMGEN, UCB Biopharma, and Les Laboratoires Servier, outside the submitted work; and Janssen, on behalf of IMI-funded EHDEN and EMIF consortiums, and Synapse Management Partners have supported training programmes organised by DPA’s Department and open for external participants. The views expressed are those of the authors and do not necessarily represent the views or policy of the Department of Veterans Affairs or the United States Government. No other relationships or activities that could appear to have influenced the submitted work. The authors report no other conflicts of interest in this work.

References

1. Kent S, Burn E, Dawoud D, et al. Common problems, common data model solutions: evidence generation for health technology assessment. Pharmacoeconomics. 2020;39:275–285. doi:10.1007/s40273-020-00981-9

2. Forrest CB, McTigue KM, Hernandez AF, et al. PCORnet® 2020: current state, accomplishments, and future directions. J Clin Epidemiol. 2021;129:60–67. doi:10.1016/j.jclinepi.2020.09.036

3. Hripcsak G, Duke JD, Shah NH, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 2015;216:574–578.

4. Sena A, Kostka K, Schuemie M, Posada JD, Schuemie M. ohdsi-studies/Covid19CharacterizationCharybdis: Charybdis v1.1.1 - Publication Package. 2020. doi:10.5281/zenodo.4033034.

5. WHO Director-General’s opening remarks at the media briefing on COVID-19-11 March 2020; 2021. Available from: https://www.who.int/director-general/speeches/detail/who-director-general-s-opening-remarks-at-The-media-briefing-on-covid-19—11-march-2020.

6. Johns Hopkins Coronavirus Resource Center. COVID-19 map; 2021. Available from: https://coronavirus.jhu.edu/map.html. Accessed March 4, 2022.

7. COVID-19-related medical research: a meta-research and critical appraisal; 2021. Available from: https://www.docwirenews.com/abstracts/covid-19-related-Medical-research-A-meta-research-and-critical-appraisal/.

8. Teixeira da Silva JA, Tsigaris P, Erfanmanesh M. Publishing volumes in major databases related to Covid-19. Scientometrics. 2020;1–12. doi:10.1007/s11192-020-03675-3

9. Burn E, You SC, Sena AG, et al. Deep phenotyping of 34,128 adult patients hospitalized with COVID-19 in an international network study. Nat Commun. 2020;11:5009. doi:10.1038/s41467-020-18849-z

10. Subbian V, Solomonides A, Clarkson M, et al. Ethics and Informatics in the age of COVID-19: challenges and recommendations for public health organization and public policy. J Am Med Inform Assoc. 2020;27. doi:10.1093/jamia/ocaa188

11. Madhavan S, Bastarache L, Brown JS, et al. Use of electronic health records to support a public health response to the COVID-19 pandemic in the United States: a perspective from 15 academic medical centers. J Am Med Inform Assoc. 2020. doi:10.1093/jamia/ocaa287

12. Williamson EJ, Walker AJ, Bhaskaran K, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 2020;584:430–436. doi:10.1038/s41586-020-2521-4

13. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19:54–60. doi:10.1136/amiajnl-2011-000376

14. Ryan PB, Madigan D, Stang PE, Overhage JM, Racoosin JA, Hartzema AG. Empirical assessment of methods for risk identification in healthcare data: results from the experiments of the Observational Medical Outcomes Partnership. Stat Med. 2012;31:4401–4415. doi:10.1002/sim.5620

15. Reisinger SJ, Ryan PB, O’Hara DJ, et al. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. J Am Med Inform Assoc. 2010;17:652–662. doi:10.1136/jamia.2009.002477

16. Kahn MG, Callahan TJ, Barnard J, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 2016;4:1244. doi:10.13063/2327-9214.1244

17. Observational health data sciences, informatics. Chapter 15 data quality; 2021. Available from: https://ohdsi.github.io/TheBookOfOhdsi/DataQuality.html#data-quality-in-general. Accessed March 4, 2022.

18. Schuemie MJ, Cepeda MS, Suchard MA, et al. How confident are we about observational findings in healthcare: a benchmark study. Harv Data Sci Rev. 2020;2. doi:10.1162/99608f92.147cc28e

19. Kadri SS, Gundrum J, Warner S, et al. Uptake and accuracy of the diagnosis code for COVID-19 among US hospitalizations. JAMA. 2020;324:2553–2554. doi:10.1001/jama.2020.20323

20. HADES. Observational health data sciences and informatics; 2021. Available from: https://ohdsi.github.io/Hades/index.html. Accessed March 4, 2022.

21. Prats-Uribe A, Sena AG, Lai LYH, et al. Use of repurposed and adjuvant drugs in hospital patients with covid-19: multinational network cohort study. BMJ. 2021;373:n1038. doi:10.1136/bmj.n1038

22. Li X, Ostropolets A, Makadia R, Shoaibi A, Rao G, Sena AG et al. Characterising the background incidence rates of adverse events of special interest for covid-19 vaccines in eight countries: multinational network cohort study. BMJ. 2021; 373 :n1435. doi:10.1136/bmj.n1435

23. Haendel M, Chute C, Gersing K. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Inform Assoc. 2020. doi:10.1093/jamia/ocaa196

24. Weber GM, Hong C, Palmer NP, et al.; 4CE Collaborative. International comparisons of harmonized laboratory value trajectories to predict severe COVID-19: leveraging the 4CE collaborative across 342 hospitals and 6 countries: a retrospective cohort study. bioRxiv medRxiv. 2020. doi:10.1101/2020.12.16.20247684

25. Duarte-Salles T, Vizcaya D, Pistillo A, et al. Thirty-Day Outcomes of Children and Adolescents With COVID-19: An International Experience. Pediatrics September. 2021; 148 (3): e2020042929. doi:10.1542/peds.2020-042929

26. Lai LYH, Golozar A, Sena A, et al. Clinical characteristics, symptoms, management and health outcomes in 8598 pregnant women diagnosed with COVID-19 compared to 27,510 with seasonal influenza in France, Spain and the US: a network cohort analysis. medRxiv. 2020. doi:10.1101/2020.10.29.20222083

27. Roel E, Pistillo A, Recalde M, et al. Characteristics and Outcomes of Over 300,000 Patients with COVID-19 and History of Cancer in the United States and Spain. Cancer Epidemiol Biomarkers Prev. 1 October 2021; 30 (10): 1884–1894.

28. Tan EH, Sena AG, Prats-Uribe A, et al. COVID-19 in patients with autoimmune diseases: characteristics and outcomes in a multinational network of cohorts across three countries. Rheumatology. 2021;60:SI37–SI50. doi:10.1093/rheumatology/keab250

Creative Commons License © 2022 The Author(s). This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at https://www.dovepress.com/terms.php and incorporate the Creative Commons Attribution - Non Commercial (unported, v3.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.