Vision, Not Voice

Ambient returned clinicians to the conversation. AR can return them all the way.

May 06, 2026

∙ Paid

Am I just a modern day Speccy, shilling the Amaze-O-Gaze? Regardless, Banjo Kazooie deeply underrated.

Vision, not voice, is the modality I think will be most impactful in our AI era. Audio was the easy, quick win. It’s passively available with minimal workflow changes, so it will all upside - absorb what’s there and infuse it into the existing user interface. The ambient bull rush was predicated on this. It was impactful (lauded by many the first real quality-of-life improvement for doctors via tech) but we are now in the phase of wringing out the last drips of gains there.

So now many are chasing that productivity dragon again via two related tangents:

Active voice: Audio in a directional, controlling sense is the logical counterpart to passive consumption. We see Oracle going hard in this direction with their new EHR. But I have some doubts, as voiced last summer, due to repetition, privacy, and latency considerations.
Passive vision: The idea of putting up cameras in exam rooms or inpatient wards and funneling that data into existing user interfaces in the same way that ambient uses audio. This has the barriers of higher capital costs via new devices (cameras) with more limited upside (that sort of video doesn’t provide the same density of structured, actionable clinical information). Is the juice worth the squeeze? I do think there’s some niches this may make sense, like inpatient fall detection or surgical video capture.

No, the real unlock isn’t generated from a new input modality. New inputs are just sustaining innovation - pushing results back to the same monitor the clinician was already chained to. Even the most sophisticated ambient product is still just a faster way to populate the existing screens.

Keep reading with a 7-day free trial

Subscribe to Health API Guy to keep reading this post and get 7 days of free access to the full post archives.