This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Aging, is properly cited. The complete bibliographic information, a link to the original publication on https://aging.jmir.org, as well as this copyright and license information must be included.
Identifying caregiver availability, particularly for patients with dementia or those with a disability, is critical to informing the appropriate care planning by the health systems, hospitals, and providers. This information is not readily available, and there is a paucity of pragmatic approaches to automatically identifying caregiver availability and type.
Our main objective was to use medical notes to assess caregiver availability and type for hospitalized patients with dementia. Our second objective was to identify whether the patient lived at home or resided at an institution.
In this retrospective cohort study, we used 2016-2019 telephone-encounter medical notes from a single institution to develop a rule-based natural language processing (NLP) algorithm to identify the patient’s caregiver availability and place of residence. Using note-level data, we compared the results of the NLP algorithm with human-conducted chart abstraction for both training (749/976, 77%) and test sets (227/976, 23%) for a total of 223 adults aged 65 years and older diagnosed with dementia. Our outcomes included determining whether the patients (1) reside at home or in an institution, (2) have a formal caregiver, and (3) have an informal caregiver.
Test set results indicated that our NLP algorithm had high level of accuracy and reliability for identifying whether patients had an informal caregiver (
This innovative work was the first to use medical notes to pragmatically determine caregiver availability. Our NLP algorithm identified whether hospitalized patients with dementia have a formal or informal caregiver and, to a lesser extent, whether they lived at home or in an institutional setting. There is merit in using NLP to identify caregivers. This study serves as a proof of concept. Future work can use other approaches and further identify caregivers and the extent of their availability.
Clinical practice creates a large amount of structured and unstructured data [
Systematic collection of caregiver information in EMR is a challenging task [
Approximately 6 million older adults in the United States live with dementia. This number is expected to double by 2050 [
In this work, we aimed to provide a proof of concept that NLP can reliably identify caregiver availability and type via medical notes. we examined the following three outcomes: (1) whether a patient lives at home vs in an institution, (2) whether a patient has a formal caregiver (paid), and (3) whether a patient has an informal caregiver (eg, a family member). We hypothesized that using NLP, we would be able to reliably determine each of the above outcomes.
To examine caregiver availability and type of caregivers, if any, for patients diagnosed with dementia, we used medical notes from Michigan Medicine (MM)a large academic medical center in Southeast Michigan. Our initial patient cohort was identified using the International Classification of Disease, 10th revision codes (Table S1 in
There are 60 different types of medical notes in MM. We randomly explored 10 notes from each category to identify the most promising type of notes for this study. Moreover, we sought expert advice from a geriatric MM nurse to identify the most promising medical notes for information about caregivers. Both approaches led us to use
Schematic flow diagram (source: 2016-2019 Michigan Medicine Electronic Medical Records). ED: emergency department.
To accomplish high interrater reliability, 2 team members, a nurse experienced in reading and writing medical notes at MM and a social scientist with no medical background, independently annotated all notes. Discrepancies in annotation were resolved with all team members’ participation. Our research questions included the following:
Does the patient reside at home or in an institution?
Does the patient have a formal (hired) caregiver?
Does the patient have an informal (a family member or a friend) caregiver?
The above research questions were chosen because the place of residence and caregiver availability are interconnected. Furthermore, our caregiver features were not mutually exclusive because it is plausible that the patient has both an informal and formal caregiver simultaneously [
First, we preprocessed the data based on patterns we saw in the training set and then used 2 lexicons to construct a rule-based approach to characterizing each note. We measured the model’s performance in training and test sets, separately, using the
Through our annotation process, we discovered multiple terms that frequently led to false positives. For example, “family medicine” raised a false positive for “family,” and “patient portal” raised a false positive for “patient.” Additionally, some notes contain template sections and subheading phrases such as “Family History,” which would list multiple familial relations that, in this context, would not be caregivers. These words and phrases were removed from medical notes before applying our algorithm (items 3 and 4 in the “Description of Rule-Based Algorithm” below).
Our 2 lexicons were dictionaries of terms used to identify (1) place of residence and (2) type of caregiver, if any. Specific terms (eg, “home,” “atria ann arbor,” and “linden square”) were used to determine the current place of residence. To determine if a patient resides in an institution or at home, we used a list of nursing homes and care facilities in Washtenaw County (obtained from the University of Michigan) and a list of skilled nursing facilities in the state of Michigan (obtained from the Centers for Medicare and Medicaid Services website) [
Our data dictionary can be found in alphabetical order in Table S2 in
Multiple rules were implemented to account for a more complex logic in determining the place of residence and caregiver presence or type. Throughout the algorithm, a 4-word window, rather than a fewer- or more-word window, was used because the 4-word window achieved the best accuracy in the training set.
A lexicon was created that included verbs such as “agreed,” “asked,” or “reported,” suggesting that a patient resides at home if these verbs appeared within a 4-word window of the following terms: “pt,” “patient,” or “patient’s” (Table S3 in
In many cases, an institution’s name appears in the note without any relation to patient’s current place of residence or status of caregiver support. Examples include potential plans when a patient was discharged from the hospital or in discussion with family members for referral to an institution or if a patient was in an institution for subacute rehabilitation, which would not be considered a place of residence or a caregiver. We created a lexicon for institution negation words and searched for them within a 4-word window of each institution’s name. If the specified negation words were found, then the institution was disregarded for that note.
If no terms were found, it was determined that the note had no information available to predict the place of residence or caregiver presence, and all fields were set to zero.
The 13 selection criteria, provided in the annotation guidelines and coded in order in the algorithm, are described below:
Replace “PT” (uppercase) with “physical therapy” in the original medical notes to avoid erroneous pickups of “pt” as an abbreviation of “patient.”
Convert the original text in notes to lowercase letters.
Remove the following patterns in the lowercase notes to avoid false positives: “nurse navigator,” “navigator nurse,” “patient portal,” “patient name,” “relationship to patient” followed by blank spaces with no answer, “e.g., visiting nurse,” “patient & caregiver,” “patient or caregiver,” “patient &/or caregiver,” “family medicine,” “family practice,” “family doctor,” “family physician,” “alone with family,” “verbalizes understanding,” and “verbalized understanding.”
To avoid falsely labeling “family” as informal caregivers, we removed each occurrence of “family history” and all words that follow until a new line character.
Remove all occurrences of “aid” and “aids” in a lowercase note when any of the following shows up: “sleeping aid,” “sleeping aids,” “sleep aid,” “sleep aids,” “hearing aid,” “hearing aids,” “hear aid,” and “hear aids.”
Substitute the following patterns with “patient” to avoid falsely picking up “want” in the proximity of “patient” or “pt”: “want the patient,” “wants the patient,” “wanting the patient,” “want the pt,” “wants the pt,” “wanting the pt,” “want pt,” “wants pt,” and “wanting pt.”
Substitute “pt or ot,” “pt and ot,” or “pt/ot” with “physical therapy and ot” to avoid falsely picking up “pt” as “patient.”
Substitute “e-mail” with “email” before tokenization to avoid “e-mail” being split into “e” and “mail.”
Substitute variants of “patient partner” (eg, “patient’s partner,” “patients partner,” and “patient\ns partner” with “\n” being a new line character) with itself.
Oftentimes, the evidence of a “visiting nurse” may present itself as a variant (eg, “visit from a home care nurse”). To avoid missing such cases, for each sentence containing “nurse,” we searched for variants of “visiting” (eg, “visit”) before the occurrence of “nurse” within the sentence or searched for variants of “visiting” after the occurrence of “nurse” within that sentence and the sentence that follows.
To avoid falsely labeling informal caregiver or home when an institutional caregiver is present, in each lowercase medical note, we removed all occurrences of “patient,” “pt,” “care giver,” “caregiver,” and “guardian” in any sentence that included an institutional n-gram.
If any institutional term in the dictionary showed up in the note, there is evidence of institutional caregiver. To rule out false positives as potential, past, or unapproved institutional caregivers or when the service is for rehabilitation purpose only (eg, “patient discharged from Glacier Hills,” “returned home from Glacier Hills,” and “on the waiting list for Glacier Hills”), we looked for variants of “return from” (eg, “returning from” and “returned from”), “discharged from” (eg, “discharges from” and “discharging from”), “waiting list” (eg, “wait list” and “waitlist”), “cancel,” “decline,” “approve,” require,” “suggest,” “request,” and “rehabilitation” (eg, “rehab”) in the sentence containing an institutional term. If any of the variants shows up in the sentence, all occurrences of the institutional term would be disregarded in the note.
To see whether there is evidence of home self-care (home=1) when an institutional caregiver is absent, we looked for patient verbs within a prespecified (n=4) word window of “patient,” “pt,” or “patient’s.” If there is at least one patient verb within a certain neighborhood, there is evidence that the patient is involved in the decision-making of their own health to some extent. In addition, the occurrence of “relationship to patient:” followed by “patient,” “pt,” or “self” in the note also constitutes evidence of home self-care.
The main reasons for algorithm misclassifications using annotated medical notes as our gold standard will be discussed and summarized.
To test the generalizability of our algorithm in other notes, we examined what percentage of the data dictionary features can be found in other medical notes. None of these notes were annotated. The findings provided some preliminary data for the next step, which is using other medical notes to make the algorithm more generalizable.
This research is approved by Michigan Medicine’s Institutional Review Board (HUM00129193).
We used R package version 3.6.3 (The R Foundation) to develop and test our NLP algorithm.
To examine the generalizability of our algorithm using other medical notes, we measured the percentage of the features defined in our data dictionary in 5 different types of medical notes (patient care conference, pharmacy, psychiatric ED clinician, social work, and student) for our patient cohort (
Descriptive characteristics of individuals included in training and test sets (N=223a).
Characteristics | Values | |
Age at the time of hospital admission, mean (SD) | 77.96 (10.94) | |
|
||
|
Female | 128 (57.4) |
|
Male | 95 (42.6) |
|
||
|
White | 176 (78.9) |
|
Black | 33 (14.8) |
|
Hispanic | 3 (1.4) |
|
Others | 11 (4.9) |
Length of stay in hospital, mean (SD) | 6.78 (6.54) | |
|
||
|
Medicare+private | 103 (46.2) |
|
Medicare+Medicaid | 34 (15.3) |
|
Medicare only | 53 (23.8) |
|
Private only | 6 (2.7) |
|
Others or missing | 27 (12.1) |
Readmitted or died within 30 days after hospital discharge, n (%) | 54 (24.2) |
aNumber of unique individuals in the sample. Each person has one or more “Telephone Encounter” medical notes.
Model performance summary for training and test sets.
Model | Training (N=749) | Test (N=227) | |||||||||
|
Place of residence | Caregiver | Place of residence | Caregiver | |||||||
Home | Institution | Formal | Informal | Home | Institution | Formal | Informal | ||||
|
0.942 | 0.675 | 0.746 | 0.951 | 0.873 | 0.638 | 0.640 | 0.942 | |||
Accuracyb | 0.923 | 0.964 | 0.923 | 0.964 | 0.837 | 0.899 | 0.841 | 0.947 | |||
Sensitivityc | 0.947 | 0.609 | 0.680 | 0.971 | 0.870 | 0.512 | 0.571 | 0.970 | |||
Specificityd | 0.875 | 0.987 | 0.971 | 0.960 | 0.778 | 0.978 | 0.930 | 0.928 |
a
bNumber of observations, both positive and negative, correctly classified. Accuracy = (true positive + true negative) / (true positive + false positive + true negative + false negative).
cAbility of the model to predict a true positive of each category.
dAbility of the model to predict a true negative of each category.
Potential causes of misclassification with explanation and examples.
Cause of error | Example | Explanation |
Incomplete or misspelled names |
“Pt stated that the nurse from Residential had difficulty drawing her blood.” “Medications are managed by staff at Gilbert House.” “Hartland of Ann Arbor” |
Residential is short for “Residential Home Health.” If we add only “Residential” to our data dictionary, the false positive would increase. Gilbert House is not in the dictionary. Formal name is Gilbert Residence. Hartland is not in the dictionary. Formal name is Heartland Health Care Center. |
Past, uncertain, or undecided situations |
Will also need “in Home Care” order. “He shares that he has explored home health agencies (found them to be not suitable to what he is seeking).” “I love that her long-term goal is already established, and Glacier Hills is her final choice.” “Will have a visit nurse in the near future (will be at sister’s house).” |
“Home care” picked up by NLP as formal=1. Falsely picked up “home health” as formal=1. Falsely picked up institution=1 and formal=1. Falsely picked up visiting nurse as formal=1. |
Lack of specificity |
“Spoke with Donna who was caring for Mr. xxx.” “Ellen manages medications using monthly organizer.” “Calling from rehab facility and has some questions regarding wound care.” |
It is not clear whether “Donna” is a formal or informal caregiver. The algorithm picked up formal=1. Algorithm missed Ellen as a formal caregiver (formal=0). In some cases, patient stays in the rehab facilities (institution=1 and formal=1), and in some cases, patient stays at home and goes to the rehab facility (institution=0 and formal=0). Due to this ambiguity, we did not include rehab facility in the dictionary. |
Uncommon abbreviations |
“pt’s dtr” “her dau is working during the day.” |
dtr and dau are short for daughter. They were not listed in the dictionary. |
Results of the natural language processing caregiver algorithm in other medical notes (the results show what percentage of the data dictionary features can be found in other medical notes).
Note type | Count, n | Overall, n (%)a | Resides at home, n (%) | Resides in an institution, n (%) | Formal caregiver, n (%) | Informal caregiver, n (%) |
Telephone encounter | 2000 | 1744 (87.2) | 1326 (66.3) | 426 (21.3) | 704 (35.2) | 1612 (80.6) |
Patient care conference | 2130 | 1825 (85.7) | 1442 (67.7) | 481 (22.6) | 688 (32.3) | 1768 (83.0) |
Pharmacy | 483 | 411 (85.0) | 128 (26.4) | 320 (66.2) | 333 (68.9) | 140 (29.0) |
Psychiatric EDb clinician | 488 | 351 (71.9) | 394 (80.7) | 41 (8.4) | 55 (11.3) | 345 (70.7) |
Social work | 783 | 621 (79.3) | 612 (78.2) | 147 (18.8) | 212 (27.1) | 593 (75.7) |
Student | 1201 | 921 (76.7) | 873 (72.7) | 160 (13.3) | 240 (20.0) | 852 (70.9) |
aThe overall percentage represents the proportion that at least one of the features in our data dictionary was used in the listed medical notes, while the feature-specific percentage indicates the proportion of notes containing information regarding the specific outcome. This was done to test the generalizability of the algorithm in other medical notes for future work.
bED: emergency department.
This project was the first to assess whether medical notes can be used to identify caregiver availability and place of residence. We used a rule-based NLP algorithm on a subset of telephone encounter notes recorded between 2016 and 2019 for patients diagnosed with dementia to determine caregiver availability (formal and informal) and place of residence (home or institution). Our algorithm reliably identified the availability of an informal caregiver (
Hospitals and health systems have made, and continue to make, substantial investments in their EMR systems. Although a systematic collection of salient medical and social data remains a work in progress, successful efforts using NLP algorithm have enabled efficient mining of rich free-text medical notes for various risk assessment or decision-making tools aimed at reducing the occurrences of adverse health events and wasteful spending [
For many older patients, especially people with cognitive decline or disability, information on caregiver availability has numerous applications. For instance, a recent work by Choi et al [
For this study, we used telephone encounter notes, initiated based on phone conversations with patients or their caregivers. Most of these notes are written by nurses based on their conversations with patients or family members of the patient at different time points. Perhaps because telephone encounter notes were based on direct conversations with patients or family members (or other informal caregivers), the algorithm was highly accurate in identifying informal caregivers. Furthermore, since usually informal caregivers are close family members, we had a better data dictionary to identify them in text. On the contrary, considering the vast number of places that offer a range of services from adult day care centers to independent living, perhaps we overfitted the model in training set. Hence, there was a drop in accuracy for the other 3 variables in the test set. More work is needed to detect short- and long-term residential places or paid caregiver organizations or agencies.
Furthermore, in many cases, it was hard to manually—through human interpretation—decipher the notes. Medical notes either have no standard template or the existing templates were not standardized or used irregularly. Various health care professionals (residents, physicians, nurses, social workers, etc) with limited resources and under time pressure write these notes. Hence, nonstandardized abbreviations (ie, “dau” for “daughter”), spelling errors, and incorrect and uncommon names are used regularly. Many of these issues cannot be addressed using off-the-shelf packages or programs. By contrast, although not generalizable, the rule-based NLP algorithm served as a proof of concept for addressing many institution-specific terminologies. we plan to address many of the following limitations in our future work.
Our study had a few noteworthy limitations. First, medical notes are based on unstructured text. We found large variations in the amount and type of information provided [
In this study, we used a rule-based approach to train, test, and develop an NLP algorithm using telephone encounter notes from our institution to identify whether a patient has a formal and informal caregiver and whether the patient resides at home or in an institution at each point in time. Our validated test results show high levels of accuracy and reliability, particularly in identifying whether a patient has an informal caregiver. This information is critical for vulnerable patient populations such as those with dementia. Our algorithm can be used as a stand-alone module or in conjunction with other tools to identify caregiver availability among high-risk patient populations. Future work will focus on making the algorithm more granular and generalizable so it can be used at other institutions.
Supplemental Tables 1-3.
electronic medical record
Michigan Medicine
natural language processing
We would like to thank Victoria Bristley, a registered nurse and clinical informaticist, for her help in annotating the medical notes.
The research reported in this publication was supported by National Institute on Aging of the National Institutes of Health (P30AG066582 and K01AG06361) and the Alzheimer's Association Research Grant (AARG-NTF-20-685960). Sponsors had no role in research design, data collection, data analysis, and research findings.
Since medical notes contain identifiable patient information, data sharing is not applicable. The data dictionary is available in the web-based appendix. The natural language processing algorithm is also available [
EM and JB conducted the design and development of the research strategy. WW, CN, and EM were responsible for algorithm development and data validation. All authors performed the drafting and revising of the manuscript.
None declared.