This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Aging, is properly cited. The complete bibliographic information, a link to the original publication on https://aging.jmir.org, as well as this copyright and license information must be included.
Half of long-term care (LTC) residents are malnourished, leading to increased hospitalization, mortality, and morbidity, with low quality of life. Current tracking methods are subjective and time-consuming.
This paper presented the automated food imaging and nutrient intake tracking technology designed for LTC.
A needs assessment was conducted with 21 participating staff across 12 LTC and retirement homes. We created 2 simulated LTC intake data sets comprising modified (664/1039, 63.91% plates) and regular (375/1039, 36.09% plates) texture foods. Overhead red-green-blue-depth images of plated foods were acquired, and foods were segmented using a pretrained food segmentation network. We trained a novel convolutional autoencoder food feature extractor network using an augmented UNIMIB2016 food data set. A meal-specific food classifier was appended to the feature extractor and tested on our simulated LTC food intake data sets. Food intake (percentage) was estimated as the differential volume between classified full portion and leftover plates.
The needs assessment yielded 13 nutrients of interest, requirement for objectivity and repeatability, and account for real-world environmental constraints. For 12 meal scenarios with up to 15 classes each, the top-1 classification accuracy was 88.9%, with mean intake error of −0.4 (SD 36.7) mL. Nutrient intake estimation by volume was strongly linearly correlated with nutrient estimates from mass (
The automated food imaging and nutrient intake tracking approach is a deep learning–powered computational nutrient sensing system that appears to be feasible (validated accuracy against gold-standard weighed food method, positive end user engagement) and may provide a novel means for more accurate and objective tracking of LTC residents’ food intake to support and prevent malnutrition tracking strategies.
Malnutrition leads to high morbidity [
However, there is a lack of objective and quantitative tracking methods for food and fluid intake, especially for centralized intake tracking by proxy (ie, multiple staff tracking a set of residents’ intakes). Registered dietitian (RD) referrals are triggered and nutritional support system effectiveness is monitored based on nutritional assessment best practices including unintentional weight loss and usual low intake of food [
Furthermore, there is a lack of trust in current methods because they are known to have poor accuracy and validity [
Automated tools may provide a palatable solution that removes subjectivity and has higher accuracy than human assessors. This may also enable time-efficient measurement of food intake at the energy, macronutrient, and micronutrient levels [
The purpose of this study was to describe the final stage of feasibility testing of the automated food imaging and nutrient intake tracking (AFINI-T) system comprising pixel-wise food classification and nutrient linking through intake prediction, for providing food and nutrient intake estimation with specific feasibility considerations for use in LTC. Our proposed AFINI-T technology measures food intake compared against gold-standard ground truth weighed food records, addresses automatic segmentation with integrated red-green-blue-depth (RGB-D) assessments, was evaluated in both regular texture foods (RTFs) and modified texture foods (MTFs), and describes the valence of the system within the user context.
This study used an iterative action research design, blending mixed methods needs assessment with technical implementation and experimental evaluation.
This study received ethics clearance from the University of Waterloo’s Office of Research Ethics Board (23124).
Insights motivating the technical approach described in this paper were gathered through interviews and workshop discussions with Schlegel Village team members during our previous user study, but not included in the paper [
As described in the study by Pfisterer et al [
Example images in the data sets used for training the convolutional autoencoder (ie, UNIMIB+ [UNIMIB2016 with additional green representation]) [
We used our RTF data set (9 foods; 9 classes; 375 images) and our MTF data set (47 foods; 93 classes; 664 images).
For each food item, 1 full serving was defined by the nutritional label portion size (RTF data set) or the recipe-defined portion size received from the kitchen and was weighed to the nearest 1 g using an Ohaus Valor Scale.
For the RTF data set, in which a serving size was referenced using volume, that volume of food (eg, corn) was weighed, and the mass was used thereafter. As manufacturers supply nutritional information for minerals as percentage of daily value (assuming a 2000-calorie diet), for the RTF data set, minerals were reported similarly. For more details on conversion, refer to Table S1 in
For the MTF data set, we expanded our original MTF data set [
Overview of data set characteristics. The UNIMIB+a data set was used for training and validation [
Data set overview | UNIMIB+ | RTFb | MTFc | RTF+MTF |
Number of images | 1214 | 375 | 664 | 1039 |
Number of samples | N/Ad | 3 | 134 | 137 |
Number of classes | 76 | 9 | 93 | 102 |
Number of foods represented | 76 | 9 | 47 | 56 |
Number of foods with recipes | N/A | 9 | 27 | 36 |
aUNIMIB+: UNIMIB2016 with additional green representation.
bRTF: regular texture food.
cMTF: modified texture food.
dN/A: not applicable.
List of foods in the RTFa and MTFb data sets used for testing the AFINI-Tc system.
Food component | RTF with recipes | MTF with recipes | Additional MTF with segmentations |
|
|||
|
Cheese tortellini with tomato sauce | Bow-tie pasta with carbonara sauce | Basmati rice |
|
Oatmeal | Macaroni salad | —d |
|
Whole wheat toast | Vegetable rotini | — |
|
|||
|
Corn | Asian vegetables | Beet and onion salad |
|
Mashed potatoes | Baked polenta with garlic | Cantaloupe chunks |
|
Mixed greens salad | California vegetables | Green beans with pimento |
|
— | Greek salad | Grilled vegetable salad |
|
— | Mango and pineapple | Roasted cauliflower |
|
— | Red potato salad | — |
|
— | Sauteed spinach and kale | — |
|
— | Seasoned green peas | — |
|
— | Stewed rhubarb and berries | — |
|
— | Strawberries and bananas | — |
|
— | Sweet and sour cabbage | — |
|
|||
|
Meat loaf | Baked basa | Bean and sausage strata |
|
Scrambled egg | Braised beef liver and onions | Grilled lemon and garlic chicken |
|
— | Braised lamb shanks | Pork tourtiere |
|
— | Hot dog wiener | Roast beef with miracle whip |
|
— | Orange ginger chicken | — |
|
— | Salisbury steak and gravy | — |
|
— | Teriyaki meatballs | — |
|
— | Tuna salad | — |
|
|||
|
Oatmeal cookie | Barley beef soup | Black bean soup |
|
— | Blueberry coffee crumble cake | Broken glass parfait (mixed gelatin) |
|
— | Eggplant parmigiana | Butternut squash soup |
|
— | English trifle | Cranberry spice oatmeal cookie |
|
— | Lemon chicken orzo soup | Lemon meringue pie |
|
— | — | Peach jello |
|
— | — | Pear crumble cake |
|
— | — | Roast beef with miracle whip on whole wheat |
|
— | — | Turkey burger on wheat bun |
aRTF: regular texture food.
bMTF: modified texture food.
cAFINI-T: automated food imaging and nutrient intake tracking.
dThere were varying numbers of items in the data sets.
We expanded the UNIMIB2016 data set (1027 tray images; 73 classes) [
Effect of underrepresentation of green foods in the UNIMIB2016 database on decoder output on segmented food from plates. The decoder output from the autoencoder trained on the UNIMIB+ (UNIMIB2016 with additional green representation) data set in the bottom appears less murky and more vibrant, with truer perceived greens than the UNIMIB2016 counterpart in the middle.
The following sections describe how the segmentation strategy was refined compared with our initial work [
System diagram showing the processing pipeline from image acquisition to food classification. EDFN-D: depth-refined encoder-decoder food network; RGB-D: red-green-blue-depth; UNIMIB+: UNIMIB2016 with additional green representation.
Modifications to the training process were made to enhance network performance. We introduced early stopping criteria to halt training early to avoid overfitting, yielding a network that was trained over fewer epochs than one that is overtrained and outputting a pixel-level image mask as food or no food with calibrated depth [
Here, the UNIMIB+ data were used to train the autoencoder. Using the autoencoder’s trained weights, the last layer of the autoencoder (120×160×3) was spliced to use the feature map as a latent feature extractor for classification (refer to
Convolutional autoencoder network for learned feature representation and in the context of classification. (A) The architecture for learning feature representation: an input image is given and the output is a reconstruction of that image. Training minimized the error between input and output images; we used mean squared error loss with Adam optimizer, learning rate of 0.0001, and batch size of 32. The early stop criteria used were change of loss of <0.0001 and patience of 5 epochs. (B) The autoencoder was spliced; weights were frozen; and only a classification layer for nc classes was trained for classification, where nc is the number of food items for meal c. We used categorical cross-entropy (ignoring background pixels) loss, with Adam optimizer and learning rate of 0.1. The early stop criteria used were a change of loss of <1×10−5 and patience of 5 epochs. We used 70%:30% train to validation split of augmented data. The data were augmented by generating 300 images from the full set of plates and applying random flips, rotations, and increased or decreased contrast. The outputs are distinct classes, which were mapped onto the meal-specific classifier (in this example, as ravioli [blue], salad [green], and oatmeal cookie [yellow]). ReLU: rectified linear unit; RGB: red-green-blue.
We report nutrient intake accuracy using the automated system (ie, the automated classification case) to enhance pragmatic feasibility (ie, reduced user input). For this automated approach, we developed a semantic segmentation network with a convolutional autoencoder feature extractor for classification of foods, which was roughly inspired by a highly successful convolutional neural network (CNN), the Visual Geometry Group network [
We trained an autoencoder to be a feature extractor using the UNIMIB+ data set consisting of 1214 images. Data were divided into 70% training and 30% validation. Training was performed using the Adam optimizer with batch size of 32, mean squared error loss, and early stopping (<0.0001 validation loss change) with 5-epoch patience. Only food pixels were used in the loss calculation using the ground truth masks. After training, the convolutional autoencoder network was spliced before the final 1×1 convolution block to produce original resolution 16-channel latent feature vectors. The weights of this network were frozen and used as a feature extractor for the classification training.
Given that there are many food options and as new meals are planned, we needed a flexible modular approach, which also enables us to use only 1 labeled example per item; the AFINI-T method uses only 1 full reference portion to classify foods and infer intake. For nutritional intake estimation, we leveraged the homes’ known nutritional information from menu planning software (or supplied by the manufacturer) to link proportional nutrient intake. We assumed that recipes were followed exactly.
Denoting the number of menu items for meal
This step comprised three general stages: (1) determine the relative consumption of each food item compared with a full reference portion, using food volume estimation from the depth maps; (2) compare relative consumption with nutritional information, to infer nutritional intake for each item; and (3) sum the inferred nutritional intake for each item across a plate for estimation of total nutrition consumed during a meal (for MTF, this was across the plate of one food item).
Segmentation accuracy was assessed using intersection over union (IOU). Classification accuracy was described using top-1 accuracy and summarized using per-meal classifiers. Bulk intake accuracy (ie, class-agnostic, overall food volume intake) was assessed using mean absolute error (mL) and 3D, % intake error, described in the study by Pfisterer et al [
All data were analyzed using MATLAB 2020b (MathWorks). Linear regression was used to determine the goodness of fit through the degree of correlation with
Several nutrients of concern in the RTF data set were reported in percentage daily value (ie, calcium, iron, vitamin B6, vitamin C, and zinc). We converted these values to absolute values to match the MTF data set using the 2005 Health Canada reference values for elements and vitamins. Where there was a difference across age, we used the reference for age >70 years; where there was a difference in requirement by sex, we used the average value.
This study focused on the characterization of changes in volume at the whole plate level for bulk intake estimation, reporting degree of consumption (ie, proportion of food consumed) and nutritional intake estimation using a nutritional lookup table at the food item and whole plate level. Specific needs informed by workshop and interview responses included the following:
The system shall consider evidence-based and practice-relevant priority nutrients (output: 13 nutrients of interest—macronutrients: calories, carbohydrates, fats, fiber, and protein and micronutrients: calcium, iron, sodium, vitamin B6, vitamin C, vitamin D, vitamin K, and zinc).
The system shall support current workflow in which the dietitian is the gatekeeper:
The system shall facilitate automated, objective, intake estimates.
The system shall facilitate dietitian referrals by providing repeatable nutrient-specific intake insights.
The system shall work independently of internet connection.
The system shall incorporate real-world constraints and parameters:
The system shall include a salient feature extractor that can be trained in advance and supports real-time use.
The system shall use a classification method that is light in weight for mobile app use.
The system shall include an easily updatable classifier to account for a priori menu plans considering the time of day and therapeutic diet.
The following quantitative results provide an overview of the AFINI-T system’s food and nutrition intake estimation system including segmentation, classification, volume estimation, bulk intake, and nutrient intake accuracies.
Average segmentation and classification accuracies within and across data setsa.
Data set and meal | Classes (N=102), n | Images (N=1039), n | Segmentation accuracy (IOUb), mean (SD) | Classification accuracy (top 1), % | |
|
9 | 375 | 0.929 (0.027) | 93.9 | |
|
Breakfast | 3 | 125 | 0.944 (0.019) | 93.5 |
|
Lunch | 3 | 125 | 0.919 (0.033) | 93.5 |
|
Dinner | 3 | 125 | 0.928 (0.019) | 95.1 |
|
93 | 664 | 0.879 (0.101) | 88.9 | |
|
Day 1—lunch | 5 | 25 | 0.841 (0.123) | 89 |
|
Day 1—dinner | 15 | 90 | 0.823 (0.099) | 70.2 |
|
Day 2—lunch | 12 | 74 | 0.863 (0.118) | 70.6 |
|
Day 2—dinner | 12 | 90 | 0.840 (0.122) | 64.9 |
|
Day 3—lunch | 10 | 85 | 0.834 (0.132) | 80.4 |
|
Day 3—dinner | 15 | 109 | 0.859 (0.100) | 70.4 |
|
Day 4—lunch | 9 | 60 | 0.871 (0.113) | 72.2 |
|
Day 4—dinner | 10 | 90 | 0.837 (0.107) | 67.8 |
|
Day 5—lunch | 5 | 41 | 0.881 (0.117) | 87.8 |
aThere were no samples for day 5–dinner.
bIOU: intersection over union.
cRTF: regular texture food.
dMTF: modified texture food.
As shown in
Low-density foods pose challenges to depth scanning systems. Here, volume estimation was within tolerance with food volume error of 2.5 (SD 9.2) mL, and low-density foods (eg, salad) have the largest food volume error seen for RTF: lunch of −10.1 (SD 22.2) mL. A similar issue of low-density foods is seen through the 3D, % absolute error intake of 14.4% (SD 13.1%), which we suspect is owing to the air pocket below some of the pieces of toast that are placed at a tangential angle to the plate or when 2 pieces are stacked with overhang, as shown in
The occlusion conundrum, as demonstrated by stacked toast with an overhang. As volumetric food estimation is based on pixel-wise classification, the pixels of the overhang are assumed to contain toast. This is a limitation to overhead imaging and provides a simplified example of low-density foods (eg, salad) as does rigid toast placed as an inclined plane. This is seen in the depth images; bright parts denote pixels close to the camera (ie, high food pixels). We see a gradient from low to high near the tip with a similar, but less obvious, trend in the third depth image. The depth map range was adjusted to exemplify the toast height.
Bulk intake accuracy within and across data setsa.
Data set and meal | Classes (N=102), n | Images (N=1039), n | Food volume error | Bulk intake accuracy | |||||||||||
|
|
|
Absolute error (food volume; mL), mean (SD) | Absolute error (intake; mL), mean (SD) | Error (intake; mL), mean (SD) | 3D, % absolute intake error, mean (SD) | 3D, % intake error, mean (SD) | ||||||||
|
9 | 375 | 6.6 (13.6) | 39.9 (39.9) | –7.2 (56) | 13.1 (10.9) | –2.5 (16.8) | ||||||||
|
Breakfast | 3 | 125 | 3 (4.1) | 17 (14.3) | −15.1 (16.3) | 14.4 (13.1) | −12 (15.3) | |||||||
|
Lunch | 3 | 125 | 11 (21.7) | 76.1 (48.5) | 18.1 (88.7) | 13.7 (9) | 7.6 (14.6) | |||||||
|
Dinner | 3 | 125 | 6 (6) | 26.5 (14.4) | −24.5 (17.7) | 11.2 (9.9) | −2.9 (14.7) | |||||||
|
93 | 664 | 2.1 (3.1) | 6 (5.6) | 4.4 (6.9) | 7.6 (8) | 5.9 (9.4) | ||||||||
|
Day 1—lunch | 5 | 25 | 1 (1.1) | 3.4 (3.3) | −0.9 (4.7) | 5 (4.6) | −0.3 (6.9) | |||||||
|
Day 1—dinner | 15 | 90 | 1.9 (2.9) | 4.1 (3.7) | 2.5 (5) | 7.4 (14.1) | 6.3 (14.6) | |||||||
|
Day 2—lunch | 12 | 74 | 2.2 (3.3) | 7.4 (7.3) | 6.1 (8.4) | 6.7 (5.5) | 5.1 (7) | |||||||
|
Day 2—dinner | 12 | 90 | 1.2 (1) | 4.6 (4.3) | 2.9 (5.5) | 8.3 (7.7) | 5.5 (9.9) | |||||||
|
Day 3—lunch | 10 | 85 | 3.8 (5.1) | 7.6 (6.3) | 5 (8.5) | 11.5 (10) | 10 (11.5) | |||||||
|
Day 3—dinner | 15 | 109 | 1.9 (2) | 5.5 (3.8) | 3.9 (5.4) | 6.7 (4.7) | 4.9 (6.6) | |||||||
|
Day 4—lunch | 9 | 60 | 1.5 (2.5) | 5.6 (7.5) | 4.8 (8) | 6.3 (3.9) | 5.3 (5.2) | |||||||
|
Day 4—dinner | 10 | 90 | 2.1 (1.9) | 6.5 (4.8) | 5.8 (5.6) | 6 (4.7) | 5 (5.8) | |||||||
|
Day 5—lunch | 5 | 41 | 3.4 (5.2) | 9.5 (6.7) | 7.8 (8.6) | 9.9 (5.9) | 7.7 (8.7) | |||||||
RTF+MTF | 102 | 1039 | 3.8 (8.8) | 19.9 (30.8) | –0.4 (36.7) | 9.9 (9.7) | 2.4 (13.6) |
aThere were no samples for day 5–dinner;
bRTF: regular texture food.
cMTF: modified texture food.
In
On the basis of the coefficients of determination shown in
Correlation and agreement between mass and volume estimates for determining nutritional intake at the whole plate level across all imaged samples. Left panel depicts the goodness of fit with linear regression and coefficient of determination (r2), and right panel depicts the degree of agreement between measures and bias from the Bland-Altman method. Correlation and agreement between mass and volume estimates of macronutrients are shown in the figure: (A) calories, (B) protein, and (C) fiber. In total, 3 nutrients of interest are shown here for brevity. RMSE: root mean square error.
Now, let us consider the feasibility of theoretical portability and completion task time by comparing the end-to-end AFINI-T system with the current workflow. A requirement identified in the study by Pfisterer et al [
The second benchmark is regarding theoretical task completion time. In terms of benchmarking theoretical task completion time, we can compare with results from the study by Pfisterer et al [
Summary of length of time required to complete food and fluid intake charting for 1 neighborhood (unit) comprising 16 residents, compared with theoretical AFINI-Ta processing.
Type | Mode time (minutes) | Responses, n (%) | Time range (minutes) | AFINI-T estimate (1-second acquisition) | AFINI-T estimate (10-second acquisition) |
Food (per meal) | 10 to 14 | 3 (33)b | <10 to >25 | 2 minutes 34 seconds | 9 minutes 45 seconds |
Fluid (per meal) | 10 to 14 | 4 (40)c | <10 to 25 | N/Ad | N/A |
Snack (per snack) | <10 | 5 (55)b | <10 to 19 | 52 seconds | 3 minutes 15 seconds |
aAFINI-T: automated food imaging and nutrient intake tracking.
bSample size, n=9.
cSample size, n=10.
dN/A: not applicable.
The AFINI-T method for estimating food intake is in strong agreement and tightly correlated with true intake. Especially in the case of larger intake portions, the AFINI-T method yielded accuracy of nutrient content with <5% error. For context, comparison with current visual assessment methods indicate errors in portion size 56% of the time for immediate estimation and as low as 62% for delayed recording and stating that current methods’ error is too high for accurately identifying at-risk residents [
For the current AFINI-T approach, we show that segmentation of only 1 reference image is required and that even when some pixels are misclassified, there is reasonable robustness in nutrient intake accuracy. These misclassifications tended to occur near the edges of a food segment regardless of data set, which may be from a less uniform representation near the edges either because of higher crumbliness (eg, meat loaf crumbs) or owing to the convolutional kernel extending into the
In the case of frequent nutrient database missing values (eg, vitamin D [
It is challenging to assess how AFINI-T compares with the literature because there are no food
Some accuracy for classification methods based on handcrafted features has been reported in the literature: 85% accuracy for 15 types of produce with minimum distance classifier [
Regarding accuracy reporting for segmentation and classification, these accuracies tend to not be mentioned [
For comparison, the AFINI-T system demonstrated an error of 2.4% across 13 nutrients in 56 categories (102 classes) of food with minimum
First, ground truth volume was assumed to be equivalent to the RGB-D camera assessment. Although we collected ground truth weighed food records, as this study aimed to assess overall feasibility from an accuracy perspective through the lens of end users, we did not account for ground truth volume. Therefore, we were working under the assumption that AFINI-T’s volume assessment was accurate. Volume validation against gold-standard ground truth (eg, water displacement) is needed to corroborate the accuracy (although in actuality, there is some evidence suggesting there is <3% volume error of the RealSense [
Second, although the plated foods are representative of LTC offerings, intake was physically simulated through incremental plating in the research kitchen by the researchers. Further studies need to be conducted to evaluate the imaging technology in real-world LTC resident food intake.
Future directions include adding an additional stage for automatic food type classification as specific foods rather than arbitrary classes with associated nutritional values (ie, mashed potatoes are classified as mashed potatoes after the initial segmentation step). A human-in-the-loop version, where there is the opportunity to correct all misclassified regions (ie, the best-case scenario), can further improve results, albeit at the expense of manual hands-on time and effort, which needs to be minimized. In addition, improving the algorithms to handle more complex food types (eg, salads or soups in which the food comprises multiple components) and more complex plates of food to address food mixing as seen with mashed potatoes will improve AFINI-T’s ability to assess plates
As observed in
From a translational standpoint, AFINI-T is platformed to provide actionable data-driven insights that can help to inform menu planning by dietitians and director of food services. For example, it can be used to develop recipes that are more nutrient-dense and complement the nutrients in recent past meals. Creating nutrient-dense meals while minimizing cost is a priority in LTC, as there is a fixed allocation of food cost per resident. The raw food allocation in Ontario was CAD $9.54 (US $6.82) per resident per day in 2020 [
AFINI-T is a feasible deep learning–powered computational nutrient sensing system that provides an automated, objective, and efficient alternative for food intake tracking, which provides food intake estimates. Novel contributions of this approach include a novel system with decoupled segmentation, classification, and nutrient estimation for monitoring error propagation and a convolutional autoencoder network for classifying regular texture and MTFs with top-1 accuracy of 88.9%, with mean intake error of 0.4 (SD 36.7) mL, and nutritional intake accuracy with strong agreement with gold-standard weighed food method and good agreement between methods (
This multimedia appendix contains two parts: S1—standardizing nutrient values and S2—nutrient intake accuracies. S1 consists of Table S1, describing the workflow in converting percentage daily values to absolute measurements using Health Canada’s dietary reference intake values [
automated food imaging and nutrient intake tracking
convolutional neural network
encoder-decoder food network
intersection over union
long-term care
modified texture food
registered dietitian
red-green-blue-depth
regular texture food
UNIMIB2016 with additional green representation
This study was funded by the National Science and Engineering Research Council of Canada (PGSD3-489611-2016 and PDF-503038-2017), Canadian Institutes of Health Research (202012MFE-459277), Ontario Graduate Scholarship, and Canada Research Chairs program.
The data sets can be made available upon reasonable request.
KP and RA contributed equally to this study. KP conceptualized the system; RA and AW provided additional contributions to system design. KP was the main contributor to experimental design and contributed to algorithmic design. RA was the main contributor to algorithmic design, with additional contributions from KP and AW. AC provided additional support on data collection and preliminary analyses during the initial conceptualization of this project. KP was the main contributor to data analyses; KP and RA conducted data analyses. HK provided clinical nutrition direction and perspective. JB oversaw the user study elements. AW oversaw the project as a whole. KP was the main contributor to writing the manuscript, with additional contributions from RA. All authors reviewed the manuscript.
None declared.