How We Score Apps
Last updated April 21, 2026 · Edited by Margaret Halloran, PhD, RD, LDN
This document is the working rubric every Clinical Nutrition Report ranking, single-app review, and head-to-head comparison is built against. We publish it in full because we believe a 100-point score is only as defensible as the procedure that produced it. If you want to know why we rated MyFitnessPal one way and Cal AI another, this is the document that should answer that question.
Every app on this site is evaluated against six weighted criteria. The weights are fixed across categories so that scores remain comparable, and they are deliberately set to penalize the failure modes that matter most to clinicians: inaccurate calorie estimates and brittle databases. The weights are reviewed annually by the editorial board; the next scheduled review is September 2026.
The 100-point rubric
| Criterion | Weight | What we are measuring |
|---|---|---|
| Accuracy | 25% | Mean absolute percentage error (MAPE) of an app's reported calorie estimates against weighed reference meals. |
| Database quality | 20% | Coverage, verification status, freshness of entries, and resilience against user-submitted noise. |
| AI photo recognition | 20% | Top-1 and top-3 dish identification accuracy, portion-size estimation error, and graceful failure behavior. |
| Macro tracking | 15% | Granularity, protein-leverage support, custom-macro target editing, and per-meal breakdown clarity. |
| User experience | 10% | Speed of common entry workflows, friction-of-correction, accessibility, and absence of dark patterns. |
| Price | 10% | Annual cost normalized against feature parity. We compute "dollars per usable feature" rather than headline price. |
The composite score is the weighted sum, rounded to one decimal place. We do not award fractional points within a criterion; each criterion is scored 0–100 in increments of one. An app cannot score above 100 on any single criterion, and we do not apply curve-grading across an entire ranking.
How we measure accuracy
Accuracy is the highest-weighted criterion because it is the criterion on which all other claims depend. An app with the best UX in the world cannot recommend a calorie target if it cannot count calories. We measure accuracy by submitting a fixed test battery of weighed reference meals to each app and comparing reported calorie estimates against the laboratory-derived ground truth.
Our reference meal set is built from USDA FoodData Central composition values, with portions weighed on a calibrated kitchen scale (precision 0.1 g). The battery rotates quarterly and currently includes 36 meals stratified across three difficulty tiers:
- Tier 1 (single-ingredient): 12 meals such as one medium banana, 100 g grilled chicken breast, one large egg, etc. These are gimme-points; an app that misses Tier 1 has structural problems.
- Tier 2 (composed plate): 12 meals such as a chicken-and-rice bowl with vegetables, a turkey sandwich on whole-wheat bread, a bowl of oatmeal with berries and almond butter. These test database resolution and portion judgment.
- Tier 3 (mixed dish, hidden ingredients): 12 meals such as lasagna, chicken biryani, vegetable curry, beef chili. These test inferential reasoning about hidden fat, sauce, and cooking-method calorie load.
For each meal, we record the ground-truth kilocalorie value and the value reported by each app, then compute mean absolute percentage error (MAPE) by tier and overall. An app's accuracy score is anchored at 100 - (overall MAPE * 4), capped at 100 and floored at 0. A 5% MAPE earns 80 points; a 15% MAPE earns 40 points; a 25% MAPE or worse earns zero.
Where independent published validation studies exist (notably the Consumer Reports 2017 audit, the 2023 Diabetes & Aging Initiative concordance work, and the JAMA Network Open 2024 photo-tracker evaluation), we cross-reference our results against theirs. When our findings diverge from published literature, we say so explicitly in the review.
How we measure database quality
Database quality captures four sub-dimensions, each scored 0–25 then summed:
- Coverage: We submit a 50-item search panel covering supermarket SKUs (Trader Joe's, Whole Foods 365), restaurant chain items (Chipotle, Sweetgreen, Cava), regional dishes (jollof rice, dal makhani, pho), and specialty items (Greek yogurts by brand, brand-specific protein bars). Hits with verified entries score full points; hits with user-submitted-only entries score partial.
- Verification: We sample 20 entries per app and check whether the displayed values match the manufacturer label or published USDA value. Apps that allow user submissions but do not flag verification status are penalized.
- Freshness: Restaurant menus rotate. We sample 10 chain-restaurant items and check whether the database reflects current (within six months) menu values, against the chain's published nutrition page.
- Noise resilience: We search for three intentionally ambiguous queries ("pizza", "salad", "smoothie") and score how the app surfaces canonical or branded entries vs. dumping a thousand low-quality user submissions on the first screen.
How we measure AI photo recognition
For apps that offer AI photo logging, we score photo recognition on a 100-point sub-scale comprising top-1 dish identification (40 points), top-3 dish identification (20 points), portion-size MAPE (30 points), and graceful failure behavior (10 points).
Our photo battery is 30 plates, captured in three lighting conditions (bright daylight, kitchen overhead, restaurant dim), at three angles (overhead, 45-degree, and side-on), and on three plate sizes. Each plate is logged in the app, and the app's top dish suggestion is compared against the laboratory ground truth. Top-1 match is exact identification of the principal dish; top-3 match means the principal dish appears anywhere in the suggested list. Portion error is the MAPE between the app-estimated portion (in grams or ounces) and the weighed portion.
Graceful failure means the app declines to estimate when confidence is low, or asks the user to confirm portion. Apps that confidently log a single chicken breast as "grilled tofu, 312 kcal" without flagging uncertainty are penalized for poor uncertainty calibration.
Apps without AI photo features are not penalized; the AI photo weight (20%) is redistributed proportionally across the remaining five criteria, and the redistribution is disclosed in the review header.
How we measure macro tracking
Macro tracking is scored on five sub-dimensions: granularity (does it report carbs, fat, protein, fiber, saturated fat, sugar, sodium, and ideally individual amino acids), customizable target setting (can the user set protein in g/kg or per-pound), per-meal breakdown clarity, training-day vs. rest-day adjustment for athlete users, and ease of macro-target overrides for clinical contexts (e.g., low-FODMAP, GLP-1 protein floors, ketogenic diets).
Apps that lock macro targets behind premium tiers but advertise free macro tracking are explicitly flagged. Apps that hide protein per-meal breakdown (a known issue in tracking-app design that contributes to under-eating protein at breakfast) lose points.
How we measure UX
UX is scored on speed of the four most common workflows (log a single food, log a saved meal, scan a barcode, log a photo), friction-of-correction (how many taps to fix a mis-logged item), accessibility (VoiceOver/TalkBack support, font scaling, color contrast against WCAG 2.2 AA), and absence of dark patterns. Apps that interrupt the food-logging flow with upgrade prompts more than once per session lose points. Apps that hide the cancel button on subscription paywalls lose points. Apps that gamify weight loss with streaks and leaderboards in a way that mirrors known disordered-eating risk patterns are flagged for editorial-review escalation by Lauren Westbrook (see our ED resource page).
How we measure price
We compute the annual cost in U.S. dollars at the most-common upgrade tier (typically the "Premium" or "Plus" tier that unlocks AI photo logging) and divide by the count of materially-useful features the app actually delivers (verified database, AI photo logging that meets our 60-point threshold, custom macros, etc.). The resulting "dollars per usable feature" is the basis for the price score.
We deliberately do not score "free" apps as 100 on price. A free app with an ad-loaded UX and a database too thin to log a real meal is not actually free; it is paid for in time and accuracy. The price score reflects value, not headline cost.
Evidence grading scale
Beyond the per-app rubric, every claim of clinical fact on this site (e.g., "protein at 1.6 g/kg supports lean mass during a deficit") is graded on a four-tier evidence scale, in descending order of weight:
- RCT or meta-analysis of RCTs. Highest-grade evidence. Claims supported at this level are stated without hedge.
- Prospective cohort study. Strong evidence for association; we say "associated with" rather than "causes" at this level.
- Cross-sectional or case-control study. We say "may be associated with" and cite limitations.
- Expert opinion or position statement. Lowest-grade evidence we cite. We name the issuing body (e.g., "the ISSN position stand on protein") and disclose the consensus is opinion-based.
Animal studies, in vitro work, and observational case reports are not cited as primary evidence for clinical claims. We may reference them when explicitly discussing mechanism, but never to support a clinical recommendation.
Re-test cadence
Apps move. Pricing changes; databases improve; AI models get retrained. Our re-test schedule reflects this:
- Top-5 apps in any active ranking: re-tested quarterly.
- All other apps in active rankings: re-tested semi-annually.
- Single-app reviews not currently in a ranking: re-tested every 12 months at minimum.
- Vendor-announced major releases (e.g., a new AI model rollout): triggers an out-of-cycle re-test within 30 days.
Every page on the site carries a "last reviewed" date in the byline. If you see a date older than the relevant cadence above, please contact us; we treat lapses as a quality issue.
Quality control
Every long-form piece on Clinical Nutrition Report is reviewed by a minimum of three credentialed contributors before publication: the named author, a senior reviewer with subject-matter authority, and the editor-in-chief (Margaret Halloran). Pieces that touch eating-disorder risk patterns receive an additional gating review by Lauren Westbrook, our iaedp-supervised behavioral reviewer; she has authority to block publication on language grounds even when the underlying clinical content is correct.
Citations are independently verified by Theodore Lindqvist (junior writer and fact-checker) before publication. We require every numerical claim to trace to a primary source, and we publish the source list with each piece. If a citation cannot be verified, the claim is removed.
For methodology updates, this page is the system of record. Material changes to the rubric (a weight shift, a new criterion, a redefined sub-dimension) are versioned with a dated note at the top of this page and reflected in our update log. Cosmetic edits are not versioned.
Conflicts of interest
None of the apps reviewed on this site pay for placement, and none of our reviewers hold equity in or receive honoraria from app makers in our ranking universe. Where a contributor has a disclosable relationship (Daniel Okafor's prior ISSN-affiliated honoraria, for example), they recuse from the affected category. See our affiliate disclosure for the current list of relationships.
Questions about this methodology
Questions, corrections, or proposed methodological refinements should go to editor@clinicalnutritionreport.com. We treat reasoned methodological criticism as a contribution to the rubric and credit external contributors in the page version notes when their suggestion is adopted.