Lexical Speech Features of Spontaneous Speech in Older Persons With and Without Cognitive Impairment: Reliability Analysis

Background Speech analysis data are promising digital biomarkers for the early detection of Alzheimer disease. However, despite its importance, very few studies in this area have examined whether older adults produce spontaneous speech with characteristics that are sufficiently consistent to be used as proxy markers of cognitive status. Objective This preliminary study seeks to investigate consistency across lexical characteristics of speech in older adults with and without cognitive impairment. Methods A total of 39 older adults from a larger, ongoing study (age: mean 81.1, SD 5.9 years) were included. Participants completed neuropsychological testing and both picture description tasks and expository tasks to elicit speech. Participants with T-scores of ≤40 on ≥2 cognitive tests were categorized as having mild cognitive impairment (MCI). Speech features were computed automatically by using Python and the Natural Language Toolkit. Results Reliability indices based on mean correlations for picture description tasks and expository tasks were similar in persons with and without MCI (with r ranging from 0.49 to 0.65 within tasks). Intraindividual variability was generally preserved across lexical speech features. Speech rate and filler rate were the most consistent indices for the cognitively intact group, and speech rate was the most consistent for the MCI group. Conclusions Our findings suggest that automatically calculated lexical properties of speech are consistent in older adults with varying levels of cognitive impairment. These findings encourage further investigation of the utility of speech analysis and other digital biomarkers for monitoring cognitive status over time.


Use of Digital Biomarkers as a Method for Cognitive Monitoring
Much like monitoring cardiac rhythm through smartwatches, the integration of smart technology into the daily lives of older adults creates new opportunities for the remote monitoring of cognitive function.Researchers have started to use digital biomarkers, which are defined as "objective, quantifiable, physiological, and behavioral data that are collected and measured by means of digital devices, such as embedded environmental sensors, portables, wearables, implantables, or digestibles," to help identify and track symptoms in persons with dementia [1].

Speech Analysis Data as Digital Biomarkers
A growing number of digital biomarkers have been examined in persons with Alzheimer disease and related dementias (ADRD), such as home-based motion sensors and systems that monitor driving performance.Spontaneous speech appears particularly promising, presumably because the declarative memory system that supports some aspects of language [2] changes dramatically in persons with ADRD.Technological advances now allow commonly observed language changes in persons with ADRD (eg, wordingfinding problems and empty speech) to be automatically computed from transcripts of spontaneous speech, and the resulting indices appear sensitive to early cognitive dysfunction.For example, lexical frequency, which quantifies an individual's ability to access more versus fewer common words, has been shown to predict current and future cognitive status [3,4].Other studies suggest that indices from spontaneous speech may be even more sensitive to ADRD than traditional neuropsychological language tests of confrontation naming or semantic fluency [5].

Study Aims
Though such findings are encouraging, many practical questions remain regarding the feasibility of using spontaneous speech analysis to monitor cognitive function.A key concern is the limited investigation of the psychometric properties of speech features.Put simply, whether an individual's spontaneous speech is internally consistent enough to be used as a marker of cognitive function has yet to be determined.Many person-and environment-based factors are known to influence spontaneous speech production (including age, sex, task demands, nativeness, and proficiency, among others [6,7]), and the degree to which a short sample of spontaneous speech reflects an individual's general speech has not been previously examined.This study aims to provide a preliminary examination of the reliability of lexical features calculated from the spontaneous speech produced by older adults.That is, we were interested in determining how much variability or consistency was exhibited within and across these features.In effect, our analysis is analogous to examining the test-retest reliability of a traditional neuropsychological test.We hypothesized that speech features would be consistent both between multiple instances of a similar speech elicitation task and across different types of speech elicitation tasks in persons with and without mild cognitive impairment (MCI).In combination, these analyses provide critical insight into the appropriateness of using spontaneous speech indices to predict cognitive status in older adults.

Participants
Data from 39 participants (female: n=27; age: mean 81.1, SD 5.9; range 69-90 years) with complete data were extracted from a larger, ongoing project [3].All participants' demographic and medical data were obtained through self-report, and no medical records or neuroimaging studies were available.For inclusion, participants were required to be English speakers and have no reported history of neurological conditions or severe psychiatric conditions.MCI status was determined by using criteria from past studies, namely, scoring ≥1 SD below the normative mean on 2 or more tasks within the same cognitive domain [8].Following this criterion, 26% (10/39) of the participant sample were classified as having MCI; the remaining 29 participants were classified as cognitively intact.Table 1 presents summary statistics of the demographic and neuropsychological characteristics of the sample.

Ethical Considerations
This study was approved by the Kent State University Institutional Review Board (#20-300), and all procedures were completed in accordance with the ethical standards outlined in the Declaration of Helsinki.Upon entry into the study, all participants completed an informed consent process.Individuals demonstrating intact comprehension of study activities provided written consent and those with cognitive dysfunction provided assent and consent provided by a trusted other.Participants were assigned a randomly generated study identification number to protect confidentiality and privacy, and all materials were protected through multiple security measures.At the completion of the study assessment, participants were compensated with a gift card for their time.

Neuropsychological Test Battery
To promote generalizability, participants completed a collection of commonly used neuropsychological tests of global functioning (Modified Mini-Mental State Exam [9]), attention (Digit Span Longest String Forward and Backward [10] and Trail Making Test A [11]), executive function (Trail Making Test B [11] and Frontal Assessment Battery), language (Controlled Oral Word Association Test [12], Animal Naming Test [12], and Boston Naming Test-Short Form [13]), visuospatial skills (Complex Figure Test-Copy [14,15]), and memory (Hopkins Verbal Learning Test-Revised [16] and Complex Figure Test-Delayed Recall [14,15]).Raw test scores were converted to T-scores using normative data to facilitate comparison to past work.

Speech Tasks
Participants completed 3 picture description tasks and 2 expository tasks as part of the study protocol.Speech from these tasks was audio-recorded and then transcribed manually.Picture description tasks included the Cookie Theft task from the Boston Diagnostic Aphasia Exam [17], which depicts 2 children reaching into a cookie jar and a mother washing dishes.The other two pictures were drawn in a similar style, with one showing a man changing a lightbulb [18] and the other showing a kitten in a tree [19].Expository tasks asked participants to describe an important person in their life (expository task 1) and a meaningful location or place (expository task 2).Importantly, the inclusion of a multiple categories of speech prompts (picture description tasks vs expository tasks) allowed us to examine whether different speech features can be reliably elicited across different types of tasks (eg, providing semantic structure in the form of a picture versus requiring memory retrieval and content generation).
A total of 16 lexical and semantic features were calculated based on the spontaneous speech generated from each task and were used as features in the analyses for word count, filler words, empty words, lexical frequency, the type-token ratio, the Honoré statistic, the Brunet index, speech rate, filler rate, definite articles, indefinite articles, pronouns, nouns, verbs, determiners, and content words.These features were chosen based on prior studies and clinical work that showed that these properties of speech production are often affected in persons with dementia or MCI [3].All features were calculated automatically from transcripts of the participants' speech, using Python (version 2.7.17) and the Natural Language Toolkit (version 3.2.1;Bird et al [20]).Table 2 shows the list of speech features and how they were defined; Table 3 shows the between-participant mean values for each linguistic feature that was computed from each speech sample.Filler rate was computed as words per second, counting all filler words (as defined above) divided by the total elapsed time of the speech a Computed using the Penn Treebank part of speech tags within the Python Natural Language Toolkit module (Bird et al [20]).Mean of the log of the frequency of all the words spoken by the participant.b Words per second, counting all words, nonwords, and partial words the speaker produced divided by the total elapsed time of the speech.c Words per second, counting all filler words divided by the total elapsed time of the speech.

Procedures
Participants completed all neuropsychological tests and speech elicitation tasks during a single study visit that lasted approximately 75 minutes.After providing written informed consent, participants were administered the neuropsychological test battery in a fixed order, under the supervision of a licensed clinical neuropsychologist.The aforementioned spontaneous speech tasks were then completed.The session concluded after participants were provided with a debriefing statement and compensated for their time.

Overview
As several of the speech features were measured on different scales (eg, lexical frequency was computed as number of words per million, parts of speech features were scaled by the total word count, the total number of words was a raw count, etc), the raw values for each speech feature were converted to z-scores to enable interfeature comparisons.The z-scoring of each participant's speech feature values was performed separately for each speech feature, by task (eg, picture description task 1, picture description task 2, expository task 1, etc) and cognitive status group (ie, MCI vs cognitively intact).The z-scored values for each speech feature were then used in the following analyses.

Intraindividual Variability Across Instances of the Same Speech Task
To assess the degree to which a given speech feature remained consistent for each participant across multiple instances of the same speech elicitation task, pairwise Pearson r correlations were computed between each feature and itself within each task type.Afterward, to examine the influence of cognitive dysfunction on these indices, correlations were computed separately for participants with MCI and cognitively intact participants.For example, a paired correlation was computed, for all participants in the MCI group, between the z-scored word count values for expository task 1 and the z-scored word count values for expository task 2. For the picture description tasks, the correlations were averaged over the three pairwise correlations of picture description tasks (task 1-task 2, task 1-task 3, and task 2-task 3).All averaging of correlation values was performed after the Fisher z transformation of the Pearson r correlation coefficients [21].After averaging was completed, Fisher z values were back-transformed to Pearson r values for reporting.
In order to determine whether these mean correlations were significantly larger than what would be expected for any two given measurements of the same linguistic feature, we used resampling methods.Null distributions of correlations were created for each task type by randomly pairing each participant's speech feature values with values for the same speech features from a different, randomly selected participant within the same group (MCI or cognitively intact group).These correlations show how much a participant's value for one feature correlates with a different person's value for the same feature and thus can be used as a baseline for the expected size of within-feature correlations, if there is no additional effect from within-participant reliability.This resampling procedure was repeated 10,000 times for each of the four null distributions, which were then used as the distribution against which the true correlation values were compared to compute their P value.

Intraindividual Variability Across Multiple Speech Tasks
Intraindividual variability was calculated for each speech feature by computing the SD of a participant's z-scores for a given speech feature across all 5 tasks (eg, the SD of a participant's z-transformed word count values across expository task 1, expository task 2, picture description task 1, picture description task 2, and picture description task 3).Weighted averages of the variance of these SDs were then computed as an index of intraindividual variability.These SD values were then averaged over participants for each of the 16 speech features, as shown in the following formula (larger values reflected greater intraindividual variability):

Intraindividual Variability Across Instances of the Same Speech Task
In the picture description tasks, the mean within-participant correlation between the 16 speech features and themselves across the three possible pairwise comparisons (task 1-task 2, task 1-task 3, and task 2-task 3) was high (MCI group r: mean 0.6555, SD 0.2867; cognitively intact group r: mean 0.6440, SD 0.2997).The strength of the correlation was not statistically different between the two cognitive status groups (t 30 =0.4351; P=.66; 95% CI −0.17 to 0.26).
In the expository tasks, the mean within-participant correlation between the speech features and themselves was similarly high for the MCI group (r: mean 0.6101, SD 0.3679) but lower for the cognitively intact group (r: mean 0.4971, SD 0.3586), although this between-group difference did not reach statistical significance (t 30 =1.363; P=.18; 95% CI −0.09 to 0.45).
We then examined whether these correlations were significantly different from what might be expected between any two given linguistic measures, using the resampling procedure described in the Methods section.The average correlation for each of the null distributions was extremely close to 0 (MCI group picture description task: r=0.0022; cognitively intact group picture description task: r=−0.0002;MCI group expository task: r=0.0004; cognitively intact group expository task: r=0.0002), and all 4 true within-participant correlations were significantly larger than what was expected by chance based on these null distributions (all P values were <.001).
Notably, mean correlations varied substantially across different speech features (Table 4).Some speech features showed consistently strong correlations, suggesting high reliability (such as speech rate, Brunet index, and number and rate of filler words), while others showed lower reliability (such as empty words, definite and indefinite articles, determiners, and pronouns).

Intraindividual Variability Across Multiple Speech Tasks
The amount of variability in each speech feature for each participant additionally varied as a function of speech feature and group (Table 4).The lowest amount of intraindividual variability was exhibited by speech rate and filler rate for the cognitively intact group and by speech rate for the MCI group.The largest amount of intraindividual variability differed somewhat between the MCI and cognitively intact groups; for example, definite and indefinite articles showed high between-participant variability for both groups, whereas empty words showed numerically higher variability for the cognitively intact group and pronouns showed numerically higher variability for the MCI group.

Discussion
Some evidence suggests that there is greater variability in performance on traditional cognitive screening measures (eg, Mini-Mental State Exam, Clock Drawing Test, etc) among persons with MCI [22].Although such variability itself can be a useful marker of MCI [23], variability can also make results harder to replicate and lower statistical power.Given that spontaneous speech (1) is affected in MCI and (2) may be useful for distinguishing healthy controls from individuals with MCI and ADRD [3,4,24,25], it was therefore important to establish the degree of variability (or stability) of spontaneous speech in individuals with and without MCI.The results from this preliminary study demonstrate that spontaneous speech is generally consistent in both individuals with MCI and cognitively intact older adults, as individuals maintained their lexical-semantic characteristics of speech across multiple tasks.Such findings provide initial evidence that properties of an individual's spontaneous speech are sufficiently "reliable" to be viewed as trait-like features and encourage continued investigation into the validity of speech analysis data as digital biomarkers of cognitive status.
Given the importance of the early detection of cognitive decline, future studies may be enhanced by examining the potential value in using a combination of indices from spontaneous speech to predict cognitive status-not just lexical-semantic features.For example, acoustic-phonetic aspects of speech, such as prosodic measures, pause duration, or loudness, are also impacted by ADRD and can distinguish healthy groups from clinical groups [26,27].Changes in the syntax and coherence of speech are found in persons with advanced ADRD and can be reliably detected [28,29].There is also evidence that subtle changes in extrapyramidal function predict incipient MCI and Alzheimer disease [30], and recent technological advances can automatically quantify these changes in short video clips of an individual, suggesting the possibility of extending this work into measuring behavior in video calls or videoconferencing (eg, FaceTime and Zoom) or via mobile apps [31].It is possible that a combination of multiple speech features and video analysis may prove more sensitive to early cognitive decline than a single category of linguistic features; thus, further work in this area is needed.More research should also be directed at determining the reliability of such features in other neurological brain disorders for which some aspects of language have been shown to be associated with decline, such as Parkinson disease [32].
Despite encouraging findings, this study is limited in several important ways.The sample size was modest, the analysis was cross-sectional in nature, and we only assessed speech and cognitive function during a single testing session.Although several findings were statistically significant despite the modest sample size, the nonsignificant group difference in intraindividual variability across instances of the same speech task type (expository tasks; P=.18) may have been underpowered due to the small sample.Therefore, future research on the consistency of speech tasks for assessing MCI should ensure sufficient power.Furthermore, prospective studies with larger and more diverse samples are needed to clarify the feasibility of using automated speech analysis (Soroski et al [33] used such analyses in research settings and for at-home monitoring of cognitive function), though several studies on automatic speech analysis have shown such analyses to be promising [5,34,35].Such findings will provide key insight into the stability of spontaneous speech over longer intervals (eg, weeks to months).It is also possible that the prospective monitoring of speech changes may help to overcome some of the limitations (ie, higher rates of misclassification of cognitive status) found in existing cognitive screening instruments for diverse populations [36,37] and facilitate early identification.This study is also limited in that effects of depression were not able to be explored.Future studies should examine the possible contributions of depression and anxiety to spontaneous speech in older adults, given that mental health conditions are common in older adults [38] and that depression may also alter speech content [39] and vocal features [40].Finally, an important limitation of this study is that participants' cognitive status (MCI and cognitively intact), as well as other potentially relevant medical conditions (eg, depression), was based on a self-report of their history of diagnosed neurological conditions.Detailed information regarding specific etiology was not available or objectively assessed, limiting the strength of our conclusions (including the possibility that MCI was not due to Alzheimer disease).Future studies on the reliability of speech as a marker of MCI should incorporate more comprehensive neurological evaluations to ensure that the assessment of speech reliability is valid (eg, neuroimaging and other biomarkers).
In summary, our findings suggest that lexical-semantic aspects of spontaneous speech are similarly reliable in older adults with and without MCI.This finding is an essential first step toward the widespread use of speech biomarkers as a low-burden method for cognitive monitoring and the facilitation of the early detection of neurodegeneration in persons at risk for ADRD.

Table 1 .
Demographic characteristics and neuropsychological test performance of the study sample.

Neuropsychological test performance c , mean (SD)
The participants were African American, Asian, or Hispanic or Latino.c With the exception of the Mini-Mental State Exam, of which the results are presented here as raw scores, all neuropsychological test scores were transformed to T-scores based on normative data.d HVLT: Hopkins Verbal Learning Test. b

Table 2 .
Operationalization of the speech features computed for each spontaneous speech task.
Fillers Number of filler words (eg, um, uh, and hmm) spoken by the participant; scaled by total word count Empty words Number of empty words (eg, thing, place, and stuff); scaled by total word count

Table 3 .
Mean values for the computed speech features across the five speech tasks for the full sample.

Table 4 .
Reliability values for the speech features.This section reports the mean within-participant correlations between each speech feature and itself for each task type and group.All averaged correlations were converted to Fisher z This section reports the SDs of z-scored values for each speech feature computed over all 5 tasks, which were averaged across participants within each group.Larger values reflect more intraindividual variability.
a b MCI: mild cognitive impairment.c The MCI group includes persons diagnosed with MCI.d The cognitively intact group includes persons diagnosed as not having MCI. e