Build Guide · v1

IRIS: a real-time conversation copilot for AR glasses

IRIS listens to the conversation you're in, transcribes it live, and whispers short, useful cues onto the lenses of your smart glasses — answers, gentle fact-checks, a question worth asking next — without you ever touching a phone. This is the full architecture and a start-to-finish guide to building your own.

Even Realities G2 React + Vite Deepgram Claude Cloudflare Workers
1 · What IRIS actually does 2 · The hardware & SDK 3 · System architecture 4 · Audio capture & AGC 5 · Streaming transcription 6 · Generating the cues 7 · Rendering on the lenses 8 · "Which voice is mine?" 9 · Memory, sessions & KB 10 · The Cloudflare proxy 11 · Surviving the background 12 · Build your own (step by step) 13 · Keys, cost & gotchas

1 · What IRIS actually does

Put the glasses on, double-tap the temple, and IRIS starts listening to whatever conversation you're in. As people talk:

A mode is just a swappable system prompt that changes how IRIS behaves — you ship whatever set fits your use case. The reference build has a general conversational mode, a listening-focused mode (leans toward warmth and advice over questions), and a lecture/study mode (a quiet aide that defines technical terms and identifies names/events without ever telling you to talk). Adding a mode is just writing another prompt. The whole thing is single-user, bring-your-own-API-keys, and stores everything locally on the phone plus an optional private cloud vault you control.

IRIS is a companion-app pattern: a normal web app (React) running in a WebView on your phone, talking to the glasses through a bridge SDK. The glasses are a thin display + microphone + IMU + touchpad. All the intelligence runs on the phone and in the cloud.

2 · The hardware & SDK

IRIS targets the Even Realities G2 — monochrome waveguide display glasses with a microphone, a 6-axis IMU, and a capacitive temple touchpad. Even ships an Even Hub SDK (@evenrealities/even_hub_sdk) and an app-store model: you build a web app, pack it into an .ehpk bundle, and it runs inside the Even Hub host app on the phone.

The mental model that matters: you don't draw pixels on the glasses. You create text containers at fixed coordinates and push string content into them. The host renders the text on the waveguide. So your "UI" on the lenses is a couple of positioned text boxes you update as state changes.

The bridge surface you'll use

CallWhat it does
waitForEvenAppBridge()Resolves once the host injects the bridge. Everything hangs off the returned object.
createStartUpPageContainer(...)Declares your text containers (id, name, x/y, width/height, whether they capture touch events).
textContainerUpgrade(...)Replaces the content of a container — this is how you "render".
audioControl(on)Starts/stops the mic stream from the glasses.
imuControl(on, pace)Starts/stops the IMU stream at a given rate (we use 1 kHz).
onEvenHubEvent(cb)The firehose: audio PCM chunks, IMU samples, and touchpad/system events all arrive here.
getLocalStorage / setLocalStoragePersistent key/value on the host (survives WebView reloads better than browser localStorage).
Protobuf gotcha worth knowing up front. Touchpad events arrive in onEvenHubEvent with an eventType enum. Protobuf omits fields that equal their default (zero) value, and CLICK_EVENT happens to be 0 — so a single tap arrives as an event object with no eventType field at all. If you see a known event shape but eventType === undefined, treat it as a click. This one cost real debugging time.

3 · System architecture

GLASSES (G2) PHONE (React app in WebView) CLOUD ┌───────────┐ audio PCM ┌───────────────────────────┐ WebSocket ┌─────────────┐ │ mic ─────┼──────────────▶│ AGC amplify → STT stream ├─────────────▶│ Deepgram │ │ IMU ─────┼──────────────▶│ wearer-voice tracker │◀─ transcript │ (nova-3) │ │ touchpad─┼──────events──▶│ │ └─────────────┘ │ │ │ trigger logic ─┐ │ ┌─────────────┐ │ lenses◀──┼──text push────┤ insight engine ┼──────────┼──HTTPS──────▶│ Cloudflare │ └───────────┘ │ memory + KB │ │ proxy │ Worker │ │ session store ┘ │ │ /anthropic │ └───────────────────────────┘ │ /search │ │ /webdav R2 │ proxies to ───▶│ Claude │ │ Tavily │ └─────────────┘

Three tiers:

  1. Glasses — dumb I/O: stream mic + IMU up, render text down, emit touch events.
  2. Phone app — the brain. Holds all state, runs the trigger logic, the wearer-voice scorer, the memory store, and orchestrates every API call. Pure React; no backend of its own.
  3. Cloudflare Worker — a thin proxy (not a server with business logic). It exists for three reasons: (a) attach API keys / dodge CORS, (b) do the bits a browser can't do well (wrap PCM into WAV for fallback STT, talk WebDAV), and (c) host an optional private R2 vault + a nightly cron that digests your sessions into a knowledge base.

Everything user-facing is local-first. The app boots from getLocalStorage, and the cloud is optional — without a proxy URL and keys, IRIS simply won't generate cues, but it still runs.

4 · Audio capture & AGC

The glasses mic delivers raw PCM16, mono, 16 kHz chunks through onEvenHubEvent as event.audioEvent.audioPcm. Two things happen to every chunk before it hits the transcriber.

1. Automatic gain control (AGC)

Glasses mics are far from the speaker and quiet. A fixed gain either clips loud talkers or leaves quiet ones inaudible, so IRIS runs a one-pole AGC that chases a target RMS with a fast attack (clamp loud spikes in ~75 ms) and a slow release (ramp up quiet rooms over a few seconds), capped at 8×:

const AGC_TARGET_RMS = 6000   // ~18% of int16 max — leaves headroom
const AGC_ATTACK = 0.40       // fast: damp loud spikes
const AGC_RELEASE = 0.02      // slow: lift quiet voices
const AGC_MAX = 8, AGC_MIN = 1

// per chunk: measure rms, chase target, clamp, apply gain to every sample
const target = AGC_TARGET_RMS / rms
const alpha  = target < gain ? AGC_ATTACK : AGC_RELEASE
gain += alpha * (target - gain)
gain  = clamp(gain, AGC_MIN, AGC_MAX)
Keep the gain value around. The current AGC gain is the key to free speaker-proximity detection later (Section 8). Since the wearer sits closest to the mic, their pre-AGC energy is reliably the loudest in the room — but only if you divide the AGC gain back out. Store gain alongside each buffered chunk.

2. A rolling 60-second PCM buffer

The same chunks are appended to a rolling 60 s ring buffer tagged with sample offsets and the gain at capture time. When Deepgram returns word-level timestamps with speaker IDs, IRIS slices the exact samples for each speaker back out of this buffer to build voice profiles. (More in Section 8.)

5 · Streaming transcription

Transcription is a direct browser WebSocket to Deepgram (nova-3, language=multi), not via the proxy — latency matters and the socket carries the key in its subprotocol. Key parameters:

wss://api.deepgram.com/v1/listen?
  model=nova-3 & language=multi & encoding=linear16 & sample_rate=16000 &
  punctuate=true & smart_format=true & interim_results=true &
  utterance_end_ms=1000 & endpointing=200 & diarize=true

Diarization → speaker labels

With diarize=true, each word carries a speaker index. IRIS groups consecutive words by speaker into [S0]: ... [S1]: ... segments. Those raw labels flow through the whole pipeline and get rewritten to real names ("[Sarah]", "[You]") once a speaker is identified or the user confirms them.

Three reliability tricks

There's also a hallucination filter — Deepgram on near-silence loves to emit "music." or "subscribe", so a tiny regex list drops those.

Lecture mode turns diarization off and sets endpointing=0 — one continuous speaker, maximum patience, no chopping a lecturer mid-sentence.

6 · Generating the cues

This is the heart of the product: deciding when to speak and what to say, fast enough to be useful but rarely enough not to be noise.

When to fire

Every finalised transcript line runs through a trigger ladder:

  1. Author/expert named? (regex like "according to X", "Dr X", "X argues") → fire a web search for that person, then generate.
  2. Real-time question? ("weather", "latest", "price", "score"...) → web search the sentence, then generate.
  3. Info-seeking question? ("what is", "who was", "how do"...) → generate immediately, no search.
  4. None of the above → debounce: schedule a generation 8 s out, resetting the timer on each new line. So normal chatter only triggers a cue once it pauses.

A separate silence timer (30 s) fires a gentler "here's a thread you could pull" prompt when the conversation lulls. A minimum 80 new characters and an 8 s floor between debounced cues stop it from spamming.

What to say

One Claude call per cue (claude-sonnet-4-6, max_tokens: 400). The system prompt = the active mode's instructions + a compact memory context + any prep notes/documents. The user message is deliberately structured:

INSIGHTS YOU ALREADY SENT THIS SESSION — do not repeat: ...
NOTE: you've already sent 1 follow-up in the last 4 cues — pick a different type.
LIVE WEB SEARCH RESULTS — use if relevant: ...
Earlier conversation (context only): ...(last ~700 chars)
Most recent (respond to this ONLY): ...(last ~300 chars)

Claude must return strict JSON: {"type": "...", "content": "..."} where type is one of answer · followup · explanation · fact-check · thought · advice, or {"type":"none"} to stay silent. The app strips any markdown fences, parses, drops the cue if type is none/invalid or it's a near-duplicate of the last six, and only then shows it.

Prompt caching pays off here. The mode instructions + memory block are large and identical across calls within a session, so they're sent as a cached system block (cache_control: ephemeral). Every cue after the first is much cheaper and faster.

Anti-annoyance design

Most of the cleverness is in not firing: dedupe against the session's own history, cap consecutive follow-ups, suppress everything if the user typed "don't give me answers" into the live context box, and never address a question to the wearer — cues are always things to say to the other person.

7 · Rendering on the lenses

Two text containers are declared at startup: a big main box (576×252) and a thin foot strip. A single effect computes the main string from the current view state and pushes it only when it changed (diffing avoids flicker):

Touchpad as the only input

All on-glasses control is the temple touchpad, decoded in onEvenHubEvent:

GestureAction
Double-tap (idle)Open mode-select → tap to confirm → starts a session
Scroll up/down (idle)Cycle modes
Scroll down (recording)Open cue history
Single tap (popup)Dismiss
Long-press (native)OS-level exit → app ends the session cleanly

Single taps are debounced ~350 ms so a double-tap can cancel the pending single — classic click/double-click disambiguation. Watchdogs re-arm the audio stream if it goes quiet for 10 s and when the phone unlocks or the app returns to foreground.

8 · "Which voice is mine?" — wearer identification

Diarization tells you there are speakers S0/S1/S2; it doesn't tell you which one is the person wearing the glasses. Getting this right matters a lot — the summariser must not attribute the other person's beliefs to you, and cues should respond to them, not you. IRIS fuses three independent signals into one wearer-confidence score.

① Voice timbre

A richer-than-pitch embedding per speaker: median pitch + variance, spectral centroid, a 4-bin spectral envelope (rough formants), zero-crossing rate, and voiced fraction — computed with a hand-rolled FFT + autocorrelation pitch detector. Cosine-ish distance to an enrolled "you" profile gives a similarity in 0..1.

② Proximity energy

The wearer sits closest to the mic, so they're loudest — before AGC. Recover pre-AGC energy by dividing the stored gain out of each speaker's buffered samples. Loudest speaker ≈ wearer. Free, needs no enrolment.

③ IMU bone conduction

When you speak, your skull/jaw vibrates the frame — the same trick AirPods use. Stream the IMU at 1 kHz, high-pass each axis with an EMA to kill gravity, and correlate the vibration envelope against the audio envelope over a 4 s window. High Pearson correlation = the current talker is you.

The three terms are combined with weight redistribution — any missing signal (no enrolled voice, no IMU on this device) drops out and its weight is shared:

wearerConfidence(voice, energy, imu):   // each 0..1, or <0 = unavailable
  voice  → weight 0.45
  energy → weight 0.25
  imu    → weight 0.40
  return Σ(value·weight) / Σ(weight present)

A confident score (≥ 0.7, backed by an enrolled voice or a corroborating IMU signal so the loudest stranger is never crowned) auto-labels that speaker as "You" and self-learns — it blends the fresh sample back into the stored profile so detection sharpens over time. Anything left ambiguous at session end surfaces a one-tap "That's me" review modal, pre-selecting the highest-confidence speaker; confirming both relabels the transcript (and regenerates the summary with correct identities) and folds the sample into your profile.

Validate the IMU on real hardware before trusting it. Whether a given frame's IMU is sensitive enough to pick up speech vibration is an empirical question. IRIS ships a Settings calibration screen with self-scaling bars and a peak-held correlation %, so you can watch the score climb while you speak and stay flat while others do. On the G2 it works — but build the test, don't assume.

9 · Memory, sessions & the knowledge base

When a session ends, one Claude call (max_tokens: 2000) extracts a structured object: title, 2–3 sentence summary, key points (theme → sub-points), action items (todo/calendar/reminder), and candidate memory items split into people / tasks / events / places / ideas. The system prompt is strict about speaker attribution ("attributing a statement to the wrong person is the worst possible error") and about only saving personally relevant facts — never public figures or general concepts Claude already knows.

Everything persists through the host bridge's key/value store, mirrored to browser localStorage as a fallback, and re-hydrated on boot.

10 · The Cloudflare Worker proxy

One small Worker (free tier) sits between the app and the paid APIs. It is intentionally thin. Routes:

RoutePurpose
POST /anthropicPass-through to the Claude Messages API (adds version header, CORS, forwards your key). Keeps keys out of cross-origin headaches and lets you swap models server-side.
POST /transcribeBatch STT fallback: wraps raw PCM into a WAV container and calls Deepgram (or Groq Whisper, with an optional translate path for non-English speech) when the live socket can't be used.
GET /searchProxies a Tavily web search for the real-time and named-person triggers.
/knowledge-base, /sessions-kbRead/write the personal KB blobs in R2.
/webdav/*A minimal WebDAV server backed by an R2 bucket, so session notes sync straight into an Obsidian vault via the "Remotely Save" plugin.
scheduled() cronNightly: reads the last ~30 session notes from R2, strips transcripts, and rebuilds a compact "session digest" the app pulls in as context.
The proxy holds no secrets of yours beyond an optional vault token — the Claude/Deepgram/Tavily keys are sent per-request from the app, entered by each user in Settings. That's what makes IRIS safely bring-your-own-key.

11 · Surviving the background

A conversation copilot is useless if it dies the moment the phone screen locks. Mobile WebViews get aggressively throttled/suspended in the background, so IRIS layers several keep-alive tricks during a session:

None of these are guaranteed by spec; they're pragmatic and platform-dependent. Test on your actual target phone with the screen locked before promising anyone continuous operation.

12 · Build your own — step by step

Step 0 · What you need

Step 1 · Scaffold the app

npm create vite@latest iris -- --template react-ts
cd iris
npm i @evenrealities/even_hub_sdk
npm i -D @evenrealities/evenhub-simulator

Add an app.json manifest declaring the package id, entrypoint (index.html), and the permissions you'll request — g2-microphone, network, and location (for background survival):

{
  "package_id": "com.you.iris",
  "name": "IRIS",
  "version": "0.1.0",
  "entrypoint": "index.html",
  "permissions": [
    {"name": "g2-microphone", "desc": "Live transcription"},
    {"name": "network", "desc": "AI + STT APIs"},
    {"name": "location", "desc": "Background operation when locked"}
  ],
  "supported_languages": ["en"]
}

Step 2 · Get a "hello lens" rendering

This is the smallest meaningful milestone — prove you can talk to the glasses:

const bridge = await waitForEvenAppBridge()
await bridge.createStartUpPageContainer(new CreateStartUpPageContainer({
  containerTotalNum: 1,
  textObject: [ new TextContainerProperty({
    containerID: 1, containerName: 'main',
    xPosition: 0, yPosition: 0, width: 576, height: 252,
    content: 'Hello from IRIS', isEventCapture: 1,
  })],
}))
bridge.onEvenHubEvent(evt => { /* log audio / imu / touch here */ })

Then wire audioControl(true) and confirm event.audioEvent.audioPcm chunks arrive when you speak. Render a tap counter to confirm touch events (remember the click=0 protobuf quirk). Until this works, nothing else matters.

Step 3 · Deploy the proxy

Create a Cloudflare Worker, paste a proxy with at minimum an /anthropic pass-through (add /search and R2 later). Deploy and copy the URL.

# wrangler.toml
name = "iris-proxy"
main = "worker.js"
compatibility_date = "2024-01-01"

npx wrangler deploy

The worker's /anthropic handler just forwards the body to https://api.anthropic.com/v1/messages, copying the x-api-key the app sends and adding CORS. That's it.

Step 4 · Live transcription

Open a Deepgram WebSocket with the params from Section 5. Feed it AGC-amplified PCM from each audio event. Render the running transcript in the phone UI first; get diarized [S0]/[S1] labels showing before you touch the glasses display. Add the KeepAlive ping and reconnect logic early — you'll need them within the first long test.

Step 5 · The cue loop

Buffer recent transcript, run the trigger ladder (Section 6), and on fire, call your proxy's /anthropic with a mode system prompt that demands strict {"type","content"} JSON. Parse, dedupe, and push the content into the lens container as a popup. Tune the debounce until it feels helpful, not chatty. This is where you'll spend most of your iteration time.

Step 6 · Sessions & memory

On stop, send the full transcript to a one-shot extraction prompt (title/summary/key points/action items/memory items). Persist sessions and memory through setLocalStorage. Read a compact memory slice back into the cue system prompt so it personalises.

Step 7 · Wearer ID (optional but high-value)

Add the proximity-energy term first (free, just divide AGC gain out). Then a voice-enrolment screen and embedding match. Then, if your hardware has an IMU, the vibration correlation — but ship the calibration screen alongside it so you can prove it works. Fall back gracefully when any signal is absent.

Step 8 · Pack & install

npm run build                 # tsc + vite build → dist/
evenhub pack app.json dist --output iris.ehpk
# install the .ehpk via the Even Hub developer flow / simulator

Iterate against the evenhub-simulator for UI/logic, then test the audio/IMU/keep-alive behaviour on real glasses + real phone, screen-locked, in a real conversation. The gap between simulator and hardware is exactly the interesting part (mic distance, IMU sensitivity, background throttling).

13 · Keys, cost & hard-won gotchas

APIs and roughly what each is for

ServiceRoleNotes
Anthropic / ClaudeAll reasoning: cues, extraction, chatclaude-sonnet-4-6. Use prompt caching for the system block.
DeepgramLive streaming STT + diarizationnova-3, language=multi. The latency-critical path.
TavilyWeb search for real-time / named-person cuesOptional. Without it, those triggers just answer from Claude's knowledge.
GroqCheap Whisper STT fallback + optional translate pathOptional. Kicks in when Deepgram is rate-limited.
CloudflareProxy + R2 vault + cronFree tier is plenty for one user.

Gotchas that will bite you

Security note for whoever you share this with: IRIS is bring-your-own-key. Every user enters their own Anthropic/Deepgram/Tavily keys in Settings; the proxy never stores them. Don't hard-code keys into the bundle, and if you stand up the R2 vault, put a real token on it — it's a private WebDAV endpoint.

That's the whole system. The hard parts aren't any single API call — they're the orchestration: knowing when to stay silent, keeping a flaky background WebView alive, and figuring out which of the voices in the room belongs to the person wearing the glasses. Build those three well and you have IRIS.