Build Guide · v1

IRIS: a real-time conversation copilot for AR glasses

IRIS listens to the conversation you're in, transcribes it live, and whispers short, useful cues onto the lenses of your smart glasses — answers, gentle fact-checks, a question worth asking next — without you ever touching a phone. This is the full architecture and a start-to-finish guide to building your own.

Even Realities G2 React + Vite Deepgram Claude Cloudflare Workers

1 · What IRIS actually does 2 · The hardware & SDK 3 · System architecture 4 · Audio capture & AGC 5 · Streaming transcription 6 · Generating the cues 7 · Rendering on the lenses 8 · "Which voice is mine?" 9 · Memory, sessions & KB 10 · The Cloudflare proxy 11 · Surviving the background 12 · Build your own (step by step) 13 · Keys, cost & gotchas

1 · What IRIS actually does

Put the glasses on, double-tap the temple, and IRIS starts listening to whatever conversation you're in. As people talk:

Live transcription scrolls in the companion phone app, with each speaker labelled.
A few seconds after someone says something you could respond to, a one- or two-line cue appears on the lenses — an answer to a question, a sharp correction to a confident factual error, a thoughtful follow-up question, or concrete advice.
When the conversation drifts into something time-sensitive ("what's the weather", "who's the new...") IRIS quietly hits a web search first and answers with fresh facts.
When you stop the session, it writes a title, summary, key points and action items, extracts people / tasks / events into a long-term memory, and (optionally) syncs a markdown note to your own cloud vault.

A mode is just a swappable system prompt that changes how IRIS behaves — you ship whatever set fits your use case. The reference build has a general conversational mode, a listening-focused mode (leans toward warmth and advice over questions), and a lecture/study mode (a quiet aide that defines technical terms and identifies names/events without ever telling you to talk). Adding a mode is just writing another prompt. The whole thing is single-user, bring-your-own-API-keys, and stores everything locally on the phone plus an optional private cloud vault you control.

IRIS is a companion-app pattern: a normal web app (React) running in a WebView on your phone, talking to the glasses through a bridge SDK. The glasses are a thin display + microphone + IMU + touchpad. All the intelligence runs on the phone and in the cloud.

2 · The hardware & SDK

IRIS targets the Even Realities G2 — monochrome waveguide display glasses with a microphone, a 6-axis IMU, and a capacitive temple touchpad. Even ships an Even Hub SDK (@evenrealities/even_hub_sdk) and an app-store model: you build a web app, pack it into an .ehpk bundle, and it runs inside the Even Hub host app on the phone.

The mental model that matters: you don't draw pixels on the glasses. You create text containers at fixed coordinates and push string content into them. The host renders the text on the waveguide. So your "UI" on the lenses is a couple of positioned text boxes you update as state changes.

The bridge surface you'll use

Call	What it does
`waitForEvenAppBridge()`	Resolves once the host injects the bridge. Everything hangs off the returned object.
`createStartUpPageContainer(...)`	Declares your text containers (id, name, x/y, width/height, whether they capture touch events).
`textContainerUpgrade(...)`	Replaces the content of a container — this is how you "render".
`audioControl(on)`	Starts/stops the mic stream from the glasses.
`imuControl(on, pace)`	Starts/stops the IMU stream at a given rate (we use 1 kHz).
`onEvenHubEvent(cb)`	The firehose: audio PCM chunks, IMU samples, and touchpad/system events all arrive here.
`getLocalStorage / setLocalStorage`	Persistent key/value on the host (survives WebView reloads better than browser localStorage).

Protobuf gotcha worth knowing up front. Touchpad events arrive in onEvenHubEvent with an eventType enum. Protobuf omits fields that equal their default (zero) value, and CLICK_EVENT happens to be 0 — so a single tap arrives as an event object with no eventType field at all. If you see a known event shape but eventType === undefined, treat it as a click. This one cost real debugging time.

3 · System architecture

GLASSES (G2) PHONE (React app in WebView) CLOUD ┌───────────┐ audio PCM ┌───────────────────────────┐ WebSocket ┌─────────────┐ │ mic ─────┼──────────────▶│ AGC amplify → STT stream ├─────────────▶│ Deepgram │ │ IMU ─────┼──────────────▶│ wearer-voice tracker │◀─ transcript │ (nova-3) │ │ touchpad─┼──────events──▶│ │ └─────────────┘ │ │ │ trigger logic ─┐ │ ┌─────────────┐ │ lenses◀──┼──text push────┤ insight engine ┼──────────┼──HTTPS──────▶│ Cloudflare │ └───────────┘ │ memory + KB │ │ proxy │ Worker │ │ session store ┘ │ │ /anthropic │ └───────────────────────────┘ │ /search │ │ /webdav R2 │ proxies to ───▶│ Claude │ │ Tavily │ └─────────────┘

Three tiers:

Glasses — dumb I/O: stream mic + IMU up, render text down, emit touch events.
Phone app — the brain. Holds all state, runs the trigger logic, the wearer-voice scorer, the memory store, and orchestrates every API call. Pure React; no backend of its own.
Cloudflare Worker — a thin proxy (not a server with business logic). It exists for three reasons: (a) attach API keys / dodge CORS, (b) do the bits a browser can't do well (wrap PCM into WAV for fallback STT, talk WebDAV), and (c) host an optional private R2 vault + a nightly cron that digests your sessions into a knowledge base.

Everything user-facing is local-first. The app boots from getLocalStorage, and the cloud is optional — without a proxy URL and keys, IRIS simply won't generate cues, but it still runs.

4 · Audio capture & AGC

The glasses mic delivers raw PCM16, mono, 16 kHz chunks through onEvenHubEvent as event.audioEvent.audioPcm. Two things happen to every chunk before it hits the transcriber.

1. Automatic gain control (AGC)

Glasses mics are far from the speaker and quiet. A fixed gain either clips loud talkers or leaves quiet ones inaudible, so IRIS runs a one-pole AGC that chases a target RMS with a fast attack (clamp loud spikes in ~75 ms) and a slow release (ramp up quiet rooms over a few seconds), capped at 8×:

const AGC_TARGET_RMS = 6000   // ~18% of int16 max — leaves headroom
const AGC_ATTACK = 0.40       // fast: damp loud spikes
const AGC_RELEASE = 0.02      // slow: lift quiet voices
const AGC_MAX = 8, AGC_MIN = 1

// per chunk: measure rms, chase target, clamp, apply gain to every sample
const target = AGC_TARGET_RMS / rms
const alpha  = target < gain ? AGC_ATTACK : AGC_RELEASE
gain += alpha * (target - gain)
gain  = clamp(gain, AGC_MIN, AGC_MAX)

Keep the gain value around. The current AGC gain is the key to free speaker-proximity detection later (Section 8). Since the wearer sits closest to the mic, their pre-AGC energy is reliably the loudest in the room — but only if you divide the AGC gain back out. Store gain alongside each buffered chunk.

2. A rolling 60-second PCM buffer

The same chunks are appended to a rolling 60 s ring buffer tagged with sample offsets and the gain at capture time. When Deepgram returns word-level timestamps with speaker IDs, IRIS slices the exact samples for each speaker back out of this buffer to build voice profiles. (More in Section 8.)

5 · Streaming transcription

Transcription is a direct browser WebSocket to Deepgram (nova-3, language=multi), not via the proxy — latency matters and the socket carries the key in its subprotocol. Key parameters:

wss://api.deepgram.com/v1/listen?
  model=nova-3 & language=multi & encoding=linear16 & sample_rate=16000 &
  punctuate=true & smart_format=true & interim_results=true &
  utterance_end_ms=1000 & endpointing=200 & diarize=true

Diarization → speaker labels

With diarize=true, each word carries a speaker index. IRIS groups consecutive words by speaker into [S0]: ... [S1]: ... segments. Those raw labels flow through the whole pipeline and get rewritten to real names ("[Sarah]", "[You]") once a speaker is identified or the user confirms them.

Three reliability tricks

KeepAlive ping every 8 s. Deepgram closes a socket after ~10 s of no audio. During silence you must send {"type":"KeepAlive"} or the stream dies mid-conversation.
Flush interims on UtteranceEnd. Over a PA system or a quiet room Deepgram sometimes never promotes an interim result to final. Listen for UtteranceEnd and flush the last interim so the transcript doesn't silently stall.
Auto-reconnect with a queue. On an unexpected close (code ≠ 1000) reconnect after 1.5 s; buffer outgoing PCM while the socket is down so nothing is lost. Treat 1008/4000/4001 as auth failures and surface them instead of looping.

There's also a hallucination filter — Deepgram on near-silence loves to emit "music." or "subscribe", so a tiny regex list drops those.

Lecture mode turns diarization off and sets endpointing=0 — one continuous speaker, maximum patience, no chopping a lecturer mid-sentence.

6 · Generating the cues

This is the heart of the product: deciding when to speak and what to say, fast enough to be useful but rarely enough not to be noise.

When to fire

Every finalised transcript line runs through a trigger ladder:

Author/expert named? (regex like "according to X", "Dr X", "X argues") → fire a web search for that person, then generate.
Real-time question? ("weather", "latest", "price", "score"...) → web search the sentence, then generate.
Info-seeking question? ("what is", "who was", "how do"...) → generate immediately, no search.
None of the above → debounce: schedule a generation 8 s out, resetting the timer on each new line. So normal chatter only triggers a cue once it pauses.

A separate silence timer (30 s) fires a gentler "here's a thread you could pull" prompt when the conversation lulls. A minimum 80 new characters and an 8 s floor between debounced cues stop it from spamming.

What to say

One Claude call per cue (claude-sonnet-4-6, max_tokens: 400). The system prompt = the active mode's instructions + a compact memory context + any prep notes/documents. The user message is deliberately structured:

INSIGHTS YOU ALREADY SENT THIS SESSION — do not repeat: ...
NOTE: you've already sent 1 follow-up in the last 4 cues — pick a different type.
LIVE WEB SEARCH RESULTS — use if relevant: ...
Earlier conversation (context only): ...(last ~700 chars)
Most recent (respond to this ONLY): ...(last ~300 chars)

Claude must return strict JSON: {"type": "...", "content": "..."} where type is one of answer · followup · explanation · fact-check · thought · advice, or {"type":"none"} to stay silent. The app strips any markdown fences, parses, drops the cue if type is none/invalid or it's a near-duplicate of the last six, and only then shows it.

Prompt caching pays off here. The mode instructions + memory block are large and identical across calls within a session, so they're sent as a cached system block (cache_control: ephemeral). Every cue after the first is much cheaper and faster.

Anti-annoyance design

Most of the cleverness is in not firing: dedupe against the session's own history, cap consecutive follow-ups, suppress everything if the user typed "don't give me answers" into the live context box, and never address a question to the wearer — cues are always things to say to the other person.

7 · Rendering on the lenses

Two text containers are declared at startup: a big main box (576×252) and a thin foot strip. A single effect computes the main string from the current view state and pushes it only when it changed (diffing avoids flicker):

idle / recording → a status line: ● REC GENERAL with a mic pulse that animates only while audio is actually arriving, plus trailing dots while transcription is live.
popup → a new cue: a short type tag ([ANS], [ASK?], [!]...) and the content.
history → scroll back through earlier cues (3/8 [INFO]...).
mode-select / confirm → a tiny on-lens menu to pick a mode and start a session hands-free.

Touchpad as the only input

All on-glasses control is the temple touchpad, decoded in onEvenHubEvent:

Gesture	Action
Double-tap (idle)	Open mode-select → tap to confirm → starts a session
Scroll up/down (idle)	Cycle modes
Scroll down (recording)	Open cue history
Single tap (popup)	Dismiss
Long-press (native)	OS-level exit → app ends the session cleanly

Single taps are debounced ~350 ms so a double-tap can cancel the pending single — classic click/double-click disambiguation. Watchdogs re-arm the audio stream if it goes quiet for 10 s and when the phone unlocks or the app returns to foreground.

8 · "Which voice is mine?" — wearer identification

Diarization tells you there are speakers S0/S1/S2; it doesn't tell you which one is the person wearing the glasses. Getting this right matters a lot — the summariser must not attribute the other person's beliefs to you, and cues should respond to them, not you. IRIS fuses three independent signals into one wearer-confidence score.

① Voice timbre

A richer-than-pitch embedding per speaker: median pitch + variance, spectral centroid, a 4-bin spectral envelope (rough formants), zero-crossing rate, and voiced fraction — computed with a hand-rolled FFT + autocorrelation pitch detector. Cosine-ish distance to an enrolled "you" profile gives a similarity in 0..1.

② Proximity energy

The wearer sits closest to the mic, so they're loudest — before AGC. Recover pre-AGC energy by dividing the stored gain out of each speaker's buffered samples. Loudest speaker ≈ wearer. Free, needs no enrolment.

③ IMU bone conduction

When you speak, your skull/jaw vibrates the frame — the same trick AirPods use. Stream the IMU at 1 kHz, high-pass each axis with an EMA to kill gravity, and correlate the vibration envelope against the audio envelope over a 4 s window. High Pearson correlation = the current talker is you.

The three terms are combined with weight redistribution — any missing signal (no enrolled voice, no IMU on this device) drops out and its weight is shared:

wearerConfidence(voice, energy, imu):   // each 0..1, or <0 = unavailable
  voice  → weight 0.45
  energy → weight 0.25
  imu    → weight 0.40
  return Σ(value·weight) / Σ(weight present)

A confident score (≥ 0.7, backed by an enrolled voice or a corroborating IMU signal so the loudest stranger is never crowned) auto-labels that speaker as "You" and self-learns — it blends the fresh sample back into the stored profile so detection sharpens over time. Anything left ambiguous at session end surfaces a one-tap "That's me" review modal, pre-selecting the highest-confidence speaker; confirming both relabels the transcript (and regenerates the summary with correct identities) and folds the sample into your profile.

Validate the IMU on real hardware before trusting it. Whether a given frame's IMU is sensitive enough to pick up speech vibration is an empirical question. IRIS ships a Settings calibration screen with self-scaling bars and a peak-held correlation %, so you can watch the score climb while you speak and stay flat while others do. On the G2 it works — but build the test, don't assume.

9 · Memory, sessions & the knowledge base

When a session ends, one Claude call (max_tokens: 2000) extracts a structured object: title, 2–3 sentence summary, key points (theme → sub-points), action items (todo/calendar/reminder), and candidate memory items split into people / tasks / events / places / ideas. The system prompt is strict about speaker attribution ("attributing a statement to the wrong person is the worst possible error") and about only saving personally relevant facts — never public figures or general concepts Claude already knows.

Sessions are stored locally (last 50), browsable by mode-folder and month, with full-text search and a per-session regenerate button.
Memory is a deduped long-term store the cue engine reads back as context, with AI-suggested merges to keep it tidy.
Follow-up reminders auto-fire: if an action item mentions a known person, the next time that name comes up in any future conversation, IRIS surfaces "Reminder about Sarah: ...".
Knowledge base — you can paste an export of your own chat history; IRIS merges it into a persistent context block so cues are personalised from day one.

Everything persists through the host bridge's key/value store, mirrored to browser localStorage as a fallback, and re-hydrated on boot.

10 · The Cloudflare Worker proxy

One small Worker (free tier) sits between the app and the paid APIs. It is intentionally thin. Routes:

Route	Purpose
`POST /anthropic`	Pass-through to the Claude Messages API (adds version header, CORS, forwards your key). Keeps keys out of cross-origin headaches and lets you swap models server-side.
`POST /transcribe`	Batch STT fallback: wraps raw PCM into a WAV container and calls Deepgram (or Groq Whisper, with an optional translate path for non-English speech) when the live socket can't be used.
`GET /search`	Proxies a Tavily web search for the real-time and named-person triggers.
`/knowledge-base`, `/sessions-kb`	Read/write the personal KB blobs in R2.
`/webdav/*`	A minimal WebDAV server backed by an R2 bucket, so session notes sync straight into an Obsidian vault via the "Remotely Save" plugin.
`scheduled()` cron	Nightly: reads the last ~30 session notes from R2, strips transcripts, and rebuilds a compact "session digest" the app pulls in as context.

The proxy holds no secrets of yours beyond an optional vault token — the Claude/Deepgram/Tavily keys are sent per-request from the app, entered by each user in Settings. That's what makes IRIS safely bring-your-own-key.

11 · Surviving the background

A conversation copilot is useless if it dies the moment the phone screen locks. Mobile WebViews get aggressively throttled/suspended in the background, so IRIS layers several keep-alive tricks during a session:

Silent oscillator — a 1 Hz Web Audio tone at 0.001 gain keeps the audio context (and the tab) alive.
Web Lock — an indefinitely-held navigator.locks request signals "work in progress".
Heartbeat Web Worker — a worker ticking on a timer keeps a thread warm.
Geolocation watch — the location permission gives the app a reason to keep running when backgrounded (this is why IRIS requests location).
Visibility re-arm — on visibilitychange back to visible, re-enable the audio + IMU streams and kick a fresh insight in case anything was throttled.

None of these are guaranteed by spec; they're pragmatic and platform-dependent. Test on your actual target phone with the screen locked before promising anyone continuous operation.

12 · Build your own — step by step

Step 0 · What you need

A pair of Even Realities G2 glasses + the Even Hub host app, and a developer account / SDK access from Even Realities.
Node 18+, and a Cloudflare account (free).
API keys (bring your own): Anthropic (Claude), Deepgram (STT), and optionally Tavily (web search) and Groq (cheap STT fallback / translation).

Step 1 · Scaffold the app

npm create vite@latest iris -- --template react-ts
cd iris
npm i @evenrealities/even_hub_sdk
npm i -D @evenrealities/evenhub-simulator

Add an app.json manifest declaring the package id, entrypoint (index.html), and the permissions you'll request — g2-microphone, network, and location (for background survival):

{
  "package_id": "com.you.iris",
  "name": "IRIS",
  "version": "0.1.0",
  "entrypoint": "index.html",
  "permissions": [
    {"name": "g2-microphone", "desc": "Live transcription"},
    {"name": "network", "desc": "AI + STT APIs"},
    {"name": "location", "desc": "Background operation when locked"}
  ],
  "supported_languages": ["en"]
}

Step 2 · Get a "hello lens" rendering

This is the smallest meaningful milestone — prove you can talk to the glasses:

const bridge = await waitForEvenAppBridge()
await bridge.createStartUpPageContainer(new CreateStartUpPageContainer({
  containerTotalNum: 1,
  textObject: [ new TextContainerProperty({
    containerID: 1, containerName: 'main',
    xPosition: 0, yPosition: 0, width: 576, height: 252,
    content: 'Hello from IRIS', isEventCapture: 1,
  })],
}))
bridge.onEvenHubEvent(evt => { /* log audio / imu / touch here */ })

Then wire audioControl(true) and confirm event.audioEvent.audioPcm chunks arrive when you speak. Render a tap counter to confirm touch events (remember the click=0 protobuf quirk). Until this works, nothing else matters.

Step 3 · Deploy the proxy

Create a Cloudflare Worker, paste a proxy with at minimum an /anthropic pass-through (add /search and R2 later). Deploy and copy the URL.

# wrangler.toml
name = "iris-proxy"
main = "worker.js"
compatibility_date = "2024-01-01"

npx wrangler deploy

The worker's /anthropic handler just forwards the body to https://api.anthropic.com/v1/messages, copying the x-api-key the app sends and adding CORS. That's it.

Step 4 · Live transcription

Open a Deepgram WebSocket with the params from Section 5. Feed it AGC-amplified PCM from each audio event. Render the running transcript in the phone UI first; get diarized [S0]/[S1] labels showing before you touch the glasses display. Add the KeepAlive ping and reconnect logic early — you'll need them within the first long test.

Step 5 · The cue loop

Buffer recent transcript, run the trigger ladder (Section 6), and on fire, call your proxy's /anthropic with a mode system prompt that demands strict {"type","content"} JSON. Parse, dedupe, and push the content into the lens container as a popup. Tune the debounce until it feels helpful, not chatty. This is where you'll spend most of your iteration time.

Step 6 · Sessions & memory

On stop, send the full transcript to a one-shot extraction prompt (title/summary/key points/action items/memory items). Persist sessions and memory through setLocalStorage. Read a compact memory slice back into the cue system prompt so it personalises.

Step 7 · Wearer ID (optional but high-value)

Add the proximity-energy term first (free, just divide AGC gain out). Then a voice-enrolment screen and embedding match. Then, if your hardware has an IMU, the vibration correlation — but ship the calibration screen alongside it so you can prove it works. Fall back gracefully when any signal is absent.

Step 8 · Pack & install

npm run build                 # tsc + vite build → dist/
evenhub pack app.json dist --output iris.ehpk
# install the .ehpk via the Even Hub developer flow / simulator

Iterate against the evenhub-simulator for UI/logic, then test the audio/IMU/keep-alive behaviour on real glasses + real phone, screen-locked, in a real conversation. The gap between simulator and hardware is exactly the interesting part (mic distance, IMU sensitivity, background throttling).

13 · Keys, cost & hard-won gotchas

APIs and roughly what each is for

Service	Role	Notes
Anthropic / Claude	All reasoning: cues, extraction, chat	`claude-sonnet-4-6`. Use prompt caching for the system block.
Deepgram	Live streaming STT + diarization	`nova-3`, `language=multi`. The latency-critical path.
Tavily	Web search for real-time / named-person cues	Optional. Without it, those triggers just answer from Claude's knowledge.
Groq	Cheap Whisper STT fallback + optional translate path	Optional. Kicks in when Deepgram is rate-limited.
Cloudflare	Proxy + R2 vault + cron	Free tier is plenty for one user.

Gotchas that will bite you

Protobuf zero-default = click. (Section 2.) Don't ignore events with no eventType.
Deepgram's 10 s idle timeout. Send KeepAlive or your stream dies during pauses.
AGC must precede proximity detection. If you don't track and divide out the gain, "loudest = wearer" is meaningless because AGC normalised everyone to the same level.
Diarization over-splits. One person becomes S0 and S3. IRIS merges speakers whose embeddings are close once 2+ are established — budget for a collapse step.
The summariser will swap speakers if it doesn't know who "you" is. Tell it explicitly (pass the wearer's label) and regenerate after the user confirms identities.
Background suspension is real and unspec'd. Layer the keep-alive tricks and test on hardware; don't trust the simulator here.
Never address a cue to the wearer. A cue is a line to say to the other person, or a fact for the wearer — never "you should ask yourself...". This single rule does a lot for feel.

Security note for whoever you share this with: IRIS is bring-your-own-key. Every user enters their own Anthropic/Deepgram/Tavily keys in Settings; the proxy never stores them. Don't hard-code keys into the bundle, and if you stand up the R2 vault, put a real token on it — it's a private WebDAV endpoint.

That's the whole system. The hard parts aren't any single API call — they're the orchestration: knowing when to stay silent, keeping a flaky background WebView alive, and figuring out which of the voices in the room belongs to the person wearing the glasses. Build those three well and you have IRIS.