IRIS listens to the conversation you're in, transcribes it live, and whispers short, useful cues onto the lenses of your smart glasses — answers, gentle fact-checks, a question worth asking next — without you ever touching a phone. This is the full architecture and a start-to-finish guide to building your own.
Put the glasses on, double-tap the temple, and IRIS starts listening to whatever conversation you're in. As people talk:
A mode is just a swappable system prompt that changes how IRIS behaves — you ship whatever set fits your use case. The reference build has a general conversational mode, a listening-focused mode (leans toward warmth and advice over questions), and a lecture/study mode (a quiet aide that defines technical terms and identifies names/events without ever telling you to talk). Adding a mode is just writing another prompt. The whole thing is single-user, bring-your-own-API-keys, and stores everything locally on the phone plus an optional private cloud vault you control.
IRIS targets the Even Realities G2 — monochrome waveguide display glasses with a
microphone, a 6-axis IMU, and a capacitive temple touchpad. Even ships an Even Hub SDK
(@evenrealities/even_hub_sdk) and an app-store model: you build a web app, pack it into an
.ehpk bundle, and it runs inside the Even Hub host app on the phone.
The mental model that matters: you don't draw pixels on the glasses. You create text containers at fixed coordinates and push string content into them. The host renders the text on the waveguide. So your "UI" on the lenses is a couple of positioned text boxes you update as state changes.
| Call | What it does |
|---|---|
waitForEvenAppBridge() | Resolves once the host injects the bridge. Everything hangs off the returned object. |
createStartUpPageContainer(...) | Declares your text containers (id, name, x/y, width/height, whether they capture touch events). |
textContainerUpgrade(...) | Replaces the content of a container — this is how you "render". |
audioControl(on) | Starts/stops the mic stream from the glasses. |
imuControl(on, pace) | Starts/stops the IMU stream at a given rate (we use 1 kHz). |
onEvenHubEvent(cb) | The firehose: audio PCM chunks, IMU samples, and touchpad/system events all arrive here. |
getLocalStorage / setLocalStorage | Persistent key/value on the host (survives WebView reloads better than browser localStorage). |
onEvenHubEvent with an eventType enum. Protobuf omits fields that equal their
default (zero) value, and CLICK_EVENT happens to be 0 — so a single tap arrives
as an event object with no eventType field at all. If you see a known event shape
but eventType === undefined, treat it as a click. This one cost real debugging time.Three tiers:
Everything user-facing is local-first. The app boots from getLocalStorage, and the
cloud is optional — without a proxy URL and keys, IRIS simply won't generate cues, but it still runs.
The glasses mic delivers raw PCM16, mono, 16 kHz chunks through
onEvenHubEvent as event.audioEvent.audioPcm. Two things happen to every chunk
before it hits the transcriber.
Glasses mics are far from the speaker and quiet. A fixed gain either clips loud talkers or leaves quiet ones inaudible, so IRIS runs a one-pole AGC that chases a target RMS with a fast attack (clamp loud spikes in ~75 ms) and a slow release (ramp up quiet rooms over a few seconds), capped at 8×:
const AGC_TARGET_RMS = 6000 // ~18% of int16 max — leaves headroom
const AGC_ATTACK = 0.40 // fast: damp loud spikes
const AGC_RELEASE = 0.02 // slow: lift quiet voices
const AGC_MAX = 8, AGC_MIN = 1
// per chunk: measure rms, chase target, clamp, apply gain to every sample
const target = AGC_TARGET_RMS / rms
const alpha = target < gain ? AGC_ATTACK : AGC_RELEASE
gain += alpha * (target - gain)
gain = clamp(gain, AGC_MIN, AGC_MAX)
gain alongside each buffered chunk.The same chunks are appended to a rolling 60 s ring buffer tagged with sample offsets and the gain at capture time. When Deepgram returns word-level timestamps with speaker IDs, IRIS slices the exact samples for each speaker back out of this buffer to build voice profiles. (More in Section 8.)
Transcription is a direct browser WebSocket to Deepgram (nova-3,
language=multi), not via the proxy — latency matters and the socket carries the key in its
subprotocol. Key parameters:
wss://api.deepgram.com/v1/listen?
model=nova-3 & language=multi & encoding=linear16 & sample_rate=16000 &
punctuate=true & smart_format=true & interim_results=true &
utterance_end_ms=1000 & endpointing=200 & diarize=true
With diarize=true, each word carries a speaker index. IRIS groups
consecutive words by speaker into [S0]: ... [S1]: ... segments. Those raw labels flow
through the whole pipeline and get rewritten to real names ("[Sarah]", "[You]") once a speaker is
identified or the user confirms them.
{"type":"KeepAlive"} or the stream dies mid-conversation.UtteranceEnd. Over a PA system or a quiet room
Deepgram sometimes never promotes an interim result to final. Listen for UtteranceEnd
and flush the last interim so the transcript doesn't silently stall.There's also a hallucination filter — Deepgram on near-silence loves to emit "music." or "subscribe", so a tiny regex list drops those.
endpointing=0
— one continuous speaker, maximum patience, no chopping a lecturer mid-sentence.This is the heart of the product: deciding when to speak and what to say, fast enough to be useful but rarely enough not to be noise.
Every finalised transcript line runs through a trigger ladder:
A separate silence timer (30 s) fires a gentler "here's a thread you could pull" prompt when the conversation lulls. A minimum 80 new characters and an 8 s floor between debounced cues stop it from spamming.
One Claude call per cue (claude-sonnet-4-6, max_tokens: 400). The system
prompt = the active mode's instructions + a compact memory context + any prep notes/documents. The
user message is deliberately structured:
INSIGHTS YOU ALREADY SENT THIS SESSION — do not repeat: ...
NOTE: you've already sent 1 follow-up in the last 4 cues — pick a different type.
LIVE WEB SEARCH RESULTS — use if relevant: ...
Earlier conversation (context only): ...(last ~700 chars)
Most recent (respond to this ONLY): ...(last ~300 chars)
Claude must return strict JSON: {"type": "...", "content": "..."} where type is one of
answer · followup · explanation · fact-check · thought · advice, or
{"type":"none"} to stay silent. The app strips any markdown fences, parses, drops the cue
if type is none/invalid or it's a near-duplicate of the last six, and only then shows it.
cache_control: ephemeral). Every cue after the first is much cheaper and faster.Most of the cleverness is in not firing: dedupe against the session's own history, cap consecutive follow-ups, suppress everything if the user typed "don't give me answers" into the live context box, and never address a question to the wearer — cues are always things to say to the other person.
Two text containers are declared at startup: a big main box (576×252) and a thin foot strip. A single effect computes the main string from the current view state and pushes it only when it changed (diffing avoids flicker):
● REC GENERAL with a mic
pulse that animates only while audio is actually arriving, plus trailing dots while transcription is live.[ANS], [ASK?],
[!]...) and the content.3/8 [INFO]...).All on-glasses control is the temple touchpad, decoded in onEvenHubEvent:
| Gesture | Action |
|---|---|
| Double-tap (idle) | Open mode-select → tap to confirm → starts a session |
| Scroll up/down (idle) | Cycle modes |
| Scroll down (recording) | Open cue history |
| Single tap (popup) | Dismiss |
| Long-press (native) | OS-level exit → app ends the session cleanly |
Single taps are debounced ~350 ms so a double-tap can cancel the pending single — classic click/double-click disambiguation. Watchdogs re-arm the audio stream if it goes quiet for 10 s and when the phone unlocks or the app returns to foreground.
Diarization tells you there are speakers S0/S1/S2; it doesn't tell you which one is the person wearing the glasses. Getting this right matters a lot — the summariser must not attribute the other person's beliefs to you, and cues should respond to them, not you. IRIS fuses three independent signals into one wearer-confidence score.
The three terms are combined with weight redistribution — any missing signal (no enrolled voice, no IMU on this device) drops out and its weight is shared:
wearerConfidence(voice, energy, imu): // each 0..1, or <0 = unavailable
voice → weight 0.45
energy → weight 0.25
imu → weight 0.40
return Σ(value·weight) / Σ(weight present)
A confident score (≥ 0.7, backed by an enrolled voice or a corroborating IMU signal so the loudest stranger is never crowned) auto-labels that speaker as "You" and self-learns — it blends the fresh sample back into the stored profile so detection sharpens over time. Anything left ambiguous at session end surfaces a one-tap "That's me" review modal, pre-selecting the highest-confidence speaker; confirming both relabels the transcript (and regenerates the summary with correct identities) and folds the sample into your profile.
When a session ends, one Claude call (max_tokens: 2000) extracts a structured object:
title, 2–3 sentence summary, key points (theme → sub-points), action items (todo/calendar/reminder),
and candidate memory items split into people / tasks / events / places / ideas. The system
prompt is strict about speaker attribution ("attributing a statement to the wrong
person is the worst possible error") and about only saving personally relevant facts — never
public figures or general concepts Claude already knows.
Everything persists through the host bridge's key/value store, mirrored to browser
localStorage as a fallback, and re-hydrated on boot.
One small Worker (free tier) sits between the app and the paid APIs. It is intentionally thin. Routes:
| Route | Purpose |
|---|---|
POST /anthropic | Pass-through to the Claude Messages API (adds version header, CORS, forwards your key). Keeps keys out of cross-origin headaches and lets you swap models server-side. |
POST /transcribe | Batch STT fallback: wraps raw PCM into a WAV container and calls Deepgram (or Groq Whisper, with an optional translate path for non-English speech) when the live socket can't be used. |
GET /search | Proxies a Tavily web search for the real-time and named-person triggers. |
/knowledge-base, /sessions-kb | Read/write the personal KB blobs in R2. |
/webdav/* | A minimal WebDAV server backed by an R2 bucket, so session notes sync straight into an Obsidian vault via the "Remotely Save" plugin. |
scheduled() cron | Nightly: reads the last ~30 session notes from R2, strips transcripts, and rebuilds a compact "session digest" the app pulls in as context. |
A conversation copilot is useless if it dies the moment the phone screen locks. Mobile WebViews get aggressively throttled/suspended in the background, so IRIS layers several keep-alive tricks during a session:
navigator.locks request signals "work
in progress".location permission gives the app a reason to
keep running when backgrounded (this is why IRIS requests location).visibilitychange back to visible, re-enable the
audio + IMU streams and kick a fresh insight in case anything was throttled.npm create vite@latest iris -- --template react-ts
cd iris
npm i @evenrealities/even_hub_sdk
npm i -D @evenrealities/evenhub-simulator
Add an app.json manifest declaring the package id, entrypoint (index.html),
and the permissions you'll request — g2-microphone, network, and
location (for background survival):
{
"package_id": "com.you.iris",
"name": "IRIS",
"version": "0.1.0",
"entrypoint": "index.html",
"permissions": [
{"name": "g2-microphone", "desc": "Live transcription"},
{"name": "network", "desc": "AI + STT APIs"},
{"name": "location", "desc": "Background operation when locked"}
],
"supported_languages": ["en"]
}
This is the smallest meaningful milestone — prove you can talk to the glasses:
const bridge = await waitForEvenAppBridge()
await bridge.createStartUpPageContainer(new CreateStartUpPageContainer({
containerTotalNum: 1,
textObject: [ new TextContainerProperty({
containerID: 1, containerName: 'main',
xPosition: 0, yPosition: 0, width: 576, height: 252,
content: 'Hello from IRIS', isEventCapture: 1,
})],
}))
bridge.onEvenHubEvent(evt => { /* log audio / imu / touch here */ })
Then wire audioControl(true) and confirm event.audioEvent.audioPcm chunks
arrive when you speak. Render a tap counter to confirm touch events (remember the click=0 protobuf
quirk). Until this works, nothing else matters.
Create a Cloudflare Worker, paste a proxy with at minimum an /anthropic pass-through
(add /search and R2 later). Deploy and copy the URL.
# wrangler.toml
name = "iris-proxy"
main = "worker.js"
compatibility_date = "2024-01-01"
npx wrangler deploy
The worker's /anthropic handler just forwards the body to
https://api.anthropic.com/v1/messages, copying the x-api-key the app sends and
adding CORS. That's it.
Open a Deepgram WebSocket with the params from Section 5. Feed it AGC-amplified PCM from each audio
event. Render the running transcript in the phone UI first; get diarized [S0]/[S1] labels
showing before you touch the glasses display. Add the KeepAlive ping and reconnect logic early — you'll
need them within the first long test.
Buffer recent transcript, run the trigger ladder (Section 6), and on fire, call your proxy's
/anthropic with a mode system prompt that demands strict {"type","content"}
JSON. Parse, dedupe, and push the content into the lens container as a popup. Tune the debounce until it
feels helpful, not chatty. This is where you'll spend most of your iteration time.
On stop, send the full transcript to a one-shot extraction prompt (title/summary/key points/action
items/memory items). Persist sessions and memory through setLocalStorage. Read a compact
memory slice back into the cue system prompt so it personalises.
Add the proximity-energy term first (free, just divide AGC gain out). Then a voice-enrolment screen and embedding match. Then, if your hardware has an IMU, the vibration correlation — but ship the calibration screen alongside it so you can prove it works. Fall back gracefully when any signal is absent.
npm run build # tsc + vite build → dist/
evenhub pack app.json dist --output iris.ehpk
# install the .ehpk via the Even Hub developer flow / simulator
Iterate against the evenhub-simulator for UI/logic, then test the audio/IMU/keep-alive
behaviour on real glasses + real phone, screen-locked, in a real conversation. The gap between simulator
and hardware is exactly the interesting part (mic distance, IMU sensitivity, background throttling).
| Service | Role | Notes |
|---|---|---|
| Anthropic / Claude | All reasoning: cues, extraction, chat | claude-sonnet-4-6. Use prompt caching for the system block. |
| Deepgram | Live streaming STT + diarization | nova-3, language=multi. The latency-critical path. |
| Tavily | Web search for real-time / named-person cues | Optional. Without it, those triggers just answer from Claude's knowledge. |
| Groq | Cheap Whisper STT fallback + optional translate path | Optional. Kicks in when Deepgram is rate-limited. |
| Cloudflare | Proxy + R2 vault + cron | Free tier is plenty for one user. |
That's the whole system. The hard parts aren't any single API call — they're the orchestration: knowing when to stay silent, keeping a flaky background WebView alive, and figuring out which of the voices in the room belongs to the person wearing the glasses. Build those three well and you have IRIS.