Introduction

The COVID-19 pandemic has caused enormous health, social and economic burdens. SARS-CoV-2, the virus causing the disease, affects multiple body structures and organs. It infects host cells mainly through binding to ACE2, which has been established as a receptor for the SARS-CoV-2 virus as well as for SARS-CoV-1, enabling the virus to enter host cells. ACE2 is expressed in multiple tissues. The highest expression levels are reported in the small intestine and the lowest in blood vessels and muscle1. The respiratory tract is an important site of SARS-CoV-2 infection and disease morbidity. This may be explained by the high expression of ACE2 in human epithelium2. Moreover, ACE2 is expressed in oral mucosa3 and can also cause loss of smell and taste. Human voice generation is a coordinated function of multiple body structures, including the lungs, vocal folds and laryngeal muscle. About a quarter of patients with mild to moderate COVID-19 have been found to have dysphonia, and interestingly, the expression of ACE2 has also been demonstrated in the vocal folds4. Whether there is a subclinical persistence of voice abnormality after recovery from SARS-CoV-2 infection is currently unknown. These signal analyses are an emerging noninvasive voice biomarker for COVID-19 infection.

Recently deep learning has attained a breakthrough in model accuracy for the classification of images due mainly to convolutional neural networks (CNN). Not limited to image classification, CNN has been widely used in converting non-image datasets into 2D or 3D datasets. For voice classification, successful implementations include the classification of singing voice5, acoustic scene classification6, and audio events classification7. In the present study, we hypothesized that subtle voice changes could occur post COVID-19 infection. We attempted to investigate the presence of subclinical voice feature alteration in COVID-19 patients once the disease had resolved, and the ability of artificial intelligence using CNN to classify patients based on past history of COVID-19.

Materials and methods

Study sample

This was a prospective study of 76 post COVID-19 patients seen at the outpatient clinic at Chakri Naruebodindra Medical Institutes (CNMI) between May and June 2020. The study was approved by the Faculty of Medicine Ramathibodi Hospital Institutional Review Board. All methods were performed in accordance with the relevant guidelines and regulations. All participants gave their written informed consent before participating in the study. All post COVID-19 patients were more than 8 weeks after onset of symptoms at the time of the study. The exclusion criteria included pregnancy, breastfeeding, uncontrolled hypertension (systolic blood pressure > 160 mmHg or diastolic blood pressure > 100 mmHg), acute myocardial infarction or stroke in past 6 months, history of substance abuse, neurological disorders, current mental health difficulties, active smoking or having stopped smoking for not more than 6 months, alcohol consumption of more than 7 units of alcohol per week, and a history of speech and/or voice disorder such as apraxia of speech, functional articulation disorder, dysarthria, cleft lip/palate, tongue or teeth abnormality, oral occlusion, laryngeal abnormality, or neurological voice disorders. For controls, 40 healthy individuals with no underlying disease were recruited from back-office staff working at CNMI.

Voice recording

Patients who met the screening criteria were interviewed using a predefined questionnaire to collect demographic data and determine the duration of the disease. Three voice recordings were collected from each participant using a plug-in microphone on a mobile phone. The recordings consisted of a persistent ‘ah’ sound for 5 s, a Thai polysyllabic sentence selected by a voice specialist for vocal apparatus analysis, and a cough sound. The voice recordings were mono-channel and sampled at 44,100 Hz with a maximum duration of 30 s. Both the training and testing set were binary labeled.

Audio preprocessing and train-test split of the dataset

Each voice sample was divided into 100 ms (ms) subsamples and a log-mel spectrogram was computed using the Python Librosa package. The dimension of each subsample array was 128 × 32. The 2D data array was then converted to 3D suitable for downstream learning by adding a dimension containing identical 2D arrays as the original 2D array. Eighty percent of the total voice records were used as the training set, and the others as the testing set.

Neural network architecture, training and cross validation

Building and training of the neural network was performed on Tensorflow version 2 (Google, Mountain View, California, USA). We used the VGG19 pre-trained neural network for both pre-train transfer learning and model training. The VGG19 is a widely used CNN, particularly for image classification and computer vision problems due to its in-depth structure and good performance. For transfer and retraining of the VGG19 CNN, the output layer of the VGG19 was dropped and two dense layers of 64, 32 fully connected units, each with batch normalization were added. The new output layer was added with one output unit and a sigmoid activation. A 2D CNN layer was prepended the input of the pretrained VGG19. The input layer of the full transfer learning model was 128 × 32 × 1 in dimension. All layers of the modified VGG19 were made untrainable except for the last five layers to make the pre-trained CNN more suitable for the new voice dataset. Three-fold cross validation was used to assess the performance of the trained neural network. Each fold comprises 78 training samples and 38 training samples. We used a binary cross entropy loss function as our study was a binary classification problem. ADAM optimization was used for the gradient descent with a learning rate of 0.01. Parameters used during training were batch size 32, maximum training epochs 600, percentage of training sample set aside randomly for validation 20% and the matric monitored was area under the curve of the performance of the validation set.

Shannon entropy calculation

Shannon entropy of each voice type in all subjects was calculated using the Python AntroPy package.

Statistical analyses

Data were expressed as mean ± SD unless specified otherwise. Multiple logistic regression models were used for assessing potential associated factors. A p value less than 0.05 was considered statistically significant. All analyses were performed using Stata Statistical Software, Release 12 (StataCorp, College Station, TX, USA).

Results

Clinical characteristics of study participants are shown in Table 1. In this sample, patients with COVID-19 were older and had higher BMI than controls. The proportion of males to females was higher in the COVID-19 group than in the control group. Logistic regression analyses with three-fold cross-validation were used to assess the diagnostic values of clinical characteristics to predict recent COVID-19. A model with clinical characteristics including age, sex, and BMI could predict recent COVID-19 with diagnostic values shown in Table 2.

Table 1 Clinical characteristics of participants with past COVID-19 and controls (mean ± SE).
Table 2 Diagnostic values of clinical characteristics in predicting recent COVID-19.

Examples of the mel-spectrogram of the 3 voice types from a study subject were shown in Fig. 1. Table 3 shows the classification performance of CNNs using various voice types. All models were reasonably successful in distinguishing patients with previous COVID-19 from controls. The performance of the model using the polysyllabic sentence yielded the highest classification performance of all models (Table 3A–C). The coughing sound produced the lowest classification performance while the ability of the monosyllabic ‘ah’ to predict the recent COVID-19 was between the other two vocalization types.

Figure 1
figure 1

Mel-spectrogram of the 3 voice types from a study subject.

Table 3 The diagnostic performance of the (A) polysyllabic sentence ‘Hing-Hoy-Hor-Bin-Ha-Dao-Hang’, (B) ‘ah’ sound, (C) cough sound and convolutional neural networks (CNN) to classify recent COVID-19.

We further investigate if the information content of voices as measured by the Shannon entropy may in part be responsible for the better performance of the polysyllable voice. The boxplot of Shannon entropy of each type of voice from all subjects is shown in Fig. 2. The entropy of the polysyllable voice and that of the ‘ah’ voice were significantly higher than that of the cough voice. The entropy of the polysyllable voice was significantly lower than that of the ‘ah’ voice despite that it showed better classification performance than the ‘ah’ voice.

Figure 2
figure 2

Shannon entropy of the 3 voice types.

As clinical characteristics of participants with or without recent COVID-19 were not well-matched, we further used multivariate logistic regression analyses to investigate if voice can predict recent COVID-19 independently of age, gender and BMI. Clinical characteristics and the values extracted from the CNN of each fold were shown in Table 4. In most of the datasets in the threefold cross validation, voice characteristics of the polysyllabic sentence as extracted by the CNN were significantly associated with recent COVID-19 independently of age, gender and BMI, as shown in Table 5.

Table 4 Clinical characteristics and features extracted from CNN of each fold.
Table 5 Association of the convolutional neural network’s (CNN) extracted features for various voice types and recent COVID-19 after controlling for age, BMI and sex.

Discussion

In the present study, we demonstrated that voice features represented by mel-spectrogram could distinguish patients with recent COVID-19 disease from controls, particularly with polysyllabic sentences. The results suggest that the SARS-CoV-2 may affect tissue involved in voice production well beyond the resolution of the disease. Some unique characteristics of COVID-19 such as loss of smell and taste8 have been described. However, to our knowledge, the alteration in voice has been less reported. It is also important to point out that such alteration is subclinical, not obvious to either the patients or healthcare providers. For the loss of smell and taste, early resolution was reported in most patients but the abnormality can persist in some patients up to 4 weeks after the onset of symptoms9. Our study showed that the subtle change in voice could be present even 60 days after being discharged from hospital. Recently, it has been increasingly aware that some symptoms of COVID-19 can persist well beyond the recovery in infected subjects. Long COVID was characterized by symptoms of fatigue, headache, dyspnea and anosmia and was more likely with increasing age and body mass index and female sex10 and is thought to occur in approximately 10% of people infected11,12. However, how soon and for how long the alteration can be detected is currently unknown. Further studies are warranted, particularly to evaluate the presence of voice change early in the course of the disease, which, if present and specific, could be developed into a screening modality for long COVID.

Our results are in keeping with previous studies suggesting that perturbation of voice has recently been suggested as a manifestation of COVID-19 which can occur in up to a quarter of patients with mild to moderate disease13. There are several factors which potentially can be responsible for the changes in voices in COVID-19. Besides being a consequence of inflammation of related voice producing organs, SAR-CoV2 can potentially affect the vocal cords directly. Enriched expression of ACE2, the receptor for SAR-CoV2 has been demonstrated in the head and neck regions, particularly in sinuses, salivary glands, oral cavity epithelial cells and vocal cords14. Moreover, post viral vagal neuropathy could occur which can present as persistent shortness of breath despite normal chest radiograph15.

Current artificial intelligence models can achieve diagnostic performance comparable to those of medical experts in various domains16,17,18. In the present study, we demonstrated that voice features such as mel-spectrogram can be represented as an image and used as inputs for CNN. For the classification of images, a number of feature visualizations have been explored to better understand how CNN sees features in images19. These learned features are usually hard to identify and interpret from a human vision perspective, causing a lack of understanding of the CNN’s internal working mechanism. Similarly, features in the mel-spectrum which distinguish individuals with past COVID-19 and controls in the present study are unclear. This ‘black box’ nature of deep neural networks is one of its shortcomings and the deep understanding of features contributing to classification performance is difficult to attain.

There have been many attempts to use voices as biomarkers for diseases including Parkinson’s disease20, heart failure21, and diabetes mellitus22. Currently there is no consensus on which kinds of speech or voice are more suitable for use as voice markers. For example, voice biomarkers for diabetes are varied in the literature and include matched fragments of speech23, free speech24 or vowel sounds25. The relative accuracy of using different kinds of human voices for such purposes are currently unclear. However, we demonstrated in the present study that speech utterances of a complex sentence are more accurate for the prediction of previous COVID-19 infection than simple vowels or a cough sound. The underlying basis for this difference is not clear, but it may be related to the higher variation in voice features from more complex sounds which render it more effective when used for classification by machine learning methods. To explore such a notion, we further analyzed the voice types according to their Shannon entropy. Originated from information theory, Shannon entropy is a measure to reflect information content of the variable under study26,27. For the proposed features selection methodology in machine learning, almost all the information-theoretic approaches are based on Shannon entropy28. Both the polysyllabic and the ‘ah’ sounds in the present study had higher Shannon entropy than the cough sound which corresponded with their apparent better performance than the cough sound. Moreover, as participants were instructed to produce sustained vowels with a continuous phonation over a certain time, it may introduce discontinuities in the pulmonic airstream in COVID-19 infected participants leading to sporadic, unintended interruptions of phonation when expressed the polysyllabic and the ‘ah’ sounds as compared to the cough sound29. Interestingly, as far as we know, most of the studies using voice to classify the presence of COVID-19 have utilized cough sounds as the study features30,31,32. It is therefore worthwhile to further explore speeches and other voice types which may have higher information content and better classification performance than cough sounds per se. Moreover, it is of note that regardless of different accuracies, all 3 voice types produced higher sensitivity compared to specificity, this would suggest that the practical use case of voices to classify past COVID would be more appropriate for screening purpose and caution should be exercised with negative results as false negative rates could be relatively high.

There are some limitations to the present study. First, the sample size was relatively small. However, we used transfer learning with a pre-trained model to mitigate this limitation. Second, baseline characteristics were not well matched across the two participant groups. However, after controlling for unmatched clinical parameters, the polysyllabic sentence used in this study was effectively used to distinguish patients with recent COVID-19 from controls. Third, there are a number of neural network architectures suggested for audio classification33,34, however only the VGG19 CNN was explored in this study. Future studies with a larger sample size, better-matched baseline characteristics between cases and controls, and varying neural network architecture are warranted.

Conclusion

Deep learning is able to detect the subtle change in voice features of COVID-19 patients after recent resolution of the disease.