Before deploying any application that records or transcribes human speech, ensure participants are notified and that you have consent flows in place where required by the jurisdictions you operate in. This may apply whether the participant is the application user, a third party on a call, or anyone else whose voice is captured. See Recording, transcription, and consent.
This tutorial walks through building an OSDK application that captures microphone audio, streams it to a realtime model, and plays back the model's spoken response. The application authenticates as a Foundry user, so every interaction is grounded in that user's permissions and the Ontology they can access. For a high-level overview of realtime audio on Foundry, see Realtime audio.
By the end of this tutorial you will have a TypeScript application running in the browser that:
Before you begin, ensure you have the following:
@osdk/create-app. The generated application includes the OAuth scaffolding this tutorial relies on. For a full reference on the OSDK itself, see Ontology SDK.language-model-service:use-model scope used in Step 1.gpt-realtime-2.@openai/agents package installed in your application:Copied!1npm install @openai/agents
The OSDK application generated by @osdk/create-app includes an OAuth client in src/client.ts. The client is created using createPublicOauthClient from @osdk/oauth and exposes a method to retrieve the current access token. The values for clientId, foundryUrl, and redirectUrl are read at runtime from osdk-* meta tags injected into index.html by @osdk/create-app based on your .env configuration.
Configure the OAuth client with the language-model-service:use-model scope so the resulting access token is permitted to connect to the realtime endpoint. Without this scope, the realtime endpoint rejects the WebSocket connection during authentication. This scope is only available on unrestricted custom applications — verify your application is unrestricted before continuing.
The snippet below shows the full shape of src/client.ts after adding the realtime audio scope. Most of this file is generated by @osdk/create-app; the only change needed for this tutorial is including language-model-service:use-model in the scopes array. The exports are used elsewhere in the tutorial: auth for token retrieval below, foundryUrl for building the WebSocket URL in Step 2, and client for OSDK queries in Step 3.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34import { createClient, type Client } from "@osdk/client"; import { createPublicOauthClient, type PublicOauthClient } from "@osdk/oauth"; function getMetaTagContent(tagName: string): string { const elements = document.querySelectorAll(`meta[name="${tagName}"]`); const element = elements.item(elements.length - 1); const value = element ? element.getAttribute("content") : null; if (value == null || value === "") { throw new Error(`Meta tag ${tagName} not found or empty`); } return value; } export const foundryUrl = getMetaTagContent("osdk-foundryUrl"); const clientId = getMetaTagContent("osdk-clientId"); const redirectUrl = getMetaTagContent("osdk-redirectUrl"); const ontologyRid = getMetaTagContent("osdk-ontologyRid"); const scopes = [ "language-model-service:use-model", // Other scopes your application requires, for example: // "api:read-data", // "api:use-ontologies-read", ]; export const auth: PublicOauthClient = createPublicOauthClient( clientId, foundryUrl, redirectUrl, { scopes } ); export const client: Client = createClient(foundryUrl, ontologyRid, auth); export default client;
To retrieve the access token at runtime, call auth.getTokenOrUndefined(). If the user has not signed in, call auth.signIn() first and then retry:
Copied!1 2 3 4 5 6 7 8 9 10import { auth } from "../client"; let token = await auth.getTokenOrUndefined(); if (!token) { await auth.signIn(); token = await auth.getTokenOrUndefined(); } if (!token) { throw new Error("Unable to obtain access token. Please try again."); }
The realtime endpoint accepts a WebSocket connection at the following URL, where <your-foundry-domain> is the hostname of your Foundry environment (the same value used as foundryUrl in src/client.ts) and the model query parameter selects the realtime model to use:
wss://<your-foundry-domain>/language-model-service/ws/v1/open-ai/realtime?model=gpt-realtime-2
This is a Foundry proxy endpoint, analogous to the REST LLM-provider compatible APIs but exposed over a WebSocket. The proxy forwards the session to the underlying provider (OpenAI Direct or Azure OpenAI, depending on your enrollment) and preserves the provider's native realtime protocol. You can therefore use the OpenAI realtime SDK (@openai/agents/realtime) without modification.
The access token is passed as a WebSocket subprotocol value because browsers cannot set custom Authorization headers on WebSocket connections. Format the subprotocol value as Bearer-<access-token>.
The access token is visible in the browser's developer tools network panel as part of the WebSocket subprotocol value. Treat it as a credential: do not log it, do not surface it in user-facing strings, and do not transmit it to any system other than your Foundry endpoint. Access tokens are short-lived; the WebSocket session uses the token it was opened with, so a long-running session may need to be reconnected when the token expires.
Open the connection using the OpenAI realtime SDK, supplying a custom createWebSocket function that constructs the browser's WebSocket with the subprotocol value. The full connect flow is wrapped in an async function called startVoiceSession that returns once the session is ready. The function is built from three small named pieces, defined below and then composed at the end of the section. Save the result as, for example, src/voice/session.ts.
Start with the imports and constants:
Copied!1 2 3 4 5 6 7 8 9 10 11import { RealtimeAgent, RealtimeSession, OpenAIRealtimeWebSocket, } from "@openai/agents/realtime"; import { foundryUrl } from "../client"; const PROXY_URL = `wss://${new URL(foundryUrl).host}/language-model-service/ws/v1/open-ai/realtime` + `?model=gpt-realtime-2`; const CONNECT_TIMEOUT_MS = 10_000;
Define createTransport to construct the transport. The createWebSocket callback builds the browser WebSocket with the bearer token subprotocol and attaches error and close handlers that log diagnostic information. The useInsecureApiKey: true flag disables an SDK browser-only guard against using a non-ephemeral OpenAI key. It does not apply here because authentication is handled by the WebSocket subprotocol, not by an OpenAI key.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23function createTransport(token: string): OpenAIRealtimeWebSocket { return new OpenAIRealtimeWebSocket({ url: PROXY_URL, useInsecureApiKey: true, // auth is via the subprotocol below, not OpenAI's apiKey // eslint-disable-next-line @typescript-eslint/no-explicit-any createWebSocket: async ({ url }: { url: string }): Promise<any> => { const ws = new WebSocket(url, [`Bearer-${token}`]); ws.addEventListener("error", () => { console.error("WebSocket connection failed (check network tab for details)"); }); ws.addEventListener("close", (ev) => { // 1000 = normal close, do not surface as an error. if (ev.code !== 1000) { console.error(`WebSocket closed: ${ev.code}${ev.reason ? ` (${ev.reason})` : ""}`); } }); return ws; }, }); }
Define createSession to construct the session with audio input/output formats, server-side voice activity detection, and an inline transcription model so the user's spoken audio is transcribed as the conversation runs:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26function createSession( agent: RealtimeAgent, transport: OpenAIRealtimeWebSocket, ): RealtimeSession { return new RealtimeSession(agent, { transport, model: "gpt-realtime-2", tracingDisabled: true, config: { outputModalities: ["audio"], audio: { input: { format: "pcm16", transcription: { model: "whisper-1" }, turnDetection: { type: "server_vad", threshold: 0.8, silenceDurationMs: 500, prefixPaddingMs: 300, }, }, output: { format: "pcm16" }, }, }, }); }
Define awaitSessionReady to wait for the session to become usable. Calling session.connect() returns before the session can send and receive audio, so wait for session.created and then the first session.updated event. The 200 ms delay after that gives the SDK time to settle; without it, early audio chunks may be lost. If the session does not reach this state within CONNECT_TIMEOUT_MS, the promise rejects with a timeout error. The apiKey field is unused because authentication is handled by the WebSocket subprotocol; the SDK simply requires a non-empty string:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27function awaitSessionReady(session: RealtimeSession): Promise<RealtimeSession> { return new Promise<RealtimeSession>((resolve, reject) => { const timeout = setTimeout(() => { reject(new Error("Connection timeout")); }, CONNECT_TIMEOUT_MS); let created = false; let updatedCount = 0; session.on("transport_event", (ev) => { if (ev.type === "session.created") { created = true; } if (ev.type === "session.updated" && created) { updatedCount++; if (updatedCount === 1) { setTimeout(() => { clearTimeout(timeout); resolve(session); }, 200); } } }); session.connect({ apiKey: "unused" }); }); }
Finally, compose the three pieces into startVoiceSession:
Copied!1 2 3 4 5 6 7 8export async function startVoiceSession( agent: RealtimeAgent, token: string, ): Promise<RealtimeSession> { const transport = createTransport(token); const session = createSession(agent, transport); return awaitSessionReady(session); }
Call startVoiceSession(agent, token) once you have the access token from Step 1 and an agent definition (see Step 3). The function returns when the session is ready to send and receive audio.
The example above sets up a generic assistant. To make it useful, give the agent tools that read from and write to the Ontology. The OpenAI realtime SDK supports tool calls: define a tool, register it on the agent, and the model invokes the tool's execute function when it decides it needs the information.
The tool's execute function runs in the browser with the user's OAuth token and OSDK client. The model never sees the Ontology directly — it sees only what the tool returns. You decide what data to expose, what filters to apply, and what to write back.
The example below shows a tool stub. Replace the body with your own OSDK queries against your Ontology. For OSDK query syntax and examples, see TypeScript OSDK.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37import { RealtimeAgent, tool } from "@openai/agents/realtime"; import { z } from "zod"; import { client } from "../client"; const lookupProduct = tool({ name: "lookup_product", description: "Look up a product by name. Use this when the user asks about a specific product.", parameters: z.object({ name: z.string().describe("The product name to look up"), }), execute: async ({ name }) => { // TODO: replace this stub with an OSDK query against your Ontology. // For example, if your Ontology has a Product object type: // // const page = await client(Product) // .where({ name: { $startsWith: name } }) // .fetchPage({ $pageSize: 5 }); // return JSON.stringify(page.data.map((p) => ({ // id: p.id, name: p.name, price: p.price, stock: p.stock, // }))); // return JSON.stringify({ id: "stub-1", name, price: 0, stock: 0, note: "Replace the tool body with an OSDK query against your Ontology.", }); }, }); const agent = new RealtimeAgent({ name: "Assistant", instructions: "You help users find products. When the user asks about a product, call the lookup_product tool.", tools: [lookupProduct], });
Pass this agent into startVoiceSession(agent, token) from Step 2. When the user speaks, the model decides whether to call the tool. The tool's execute function runs in the browser against the OSDK client, and the result is fed back to the model for the response.
Tools can also write to the Ontology. To do this, call an action from inside the tool's execute function. Actions go through the standard Foundry permission and validation paths.
Given the RealtimeSession returned by startVoiceSession, wire up microphone capture and audio playback. Capture microphone audio using the browser's MediaDevices API, convert it to 16-bit PCM at 24 kHz, and forward each chunk to the session via session.sendAudio(chunk) as the chunks arrive. The application enqueues the audio chunks received from the session for playback.
The microphone capture and playback queue implementation is out of scope for this tutorial. The session-level wiring for receiving audio looks like:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26const session = await startVoiceSession(agent, token); // Surface mid-session errors from the SDK. WebSocket-level transport errors // (such as a failure to establish the connection) are logged to the console // from the handlers in `createTransport`; if the connection never establishes, // `startVoiceSession` rejects after `CONNECT_TIMEOUT_MS`. session.on("error", (event) => { console.error("Realtime session error:", event.error); }); // Play back audio from the model session.on("audio", (event) => { playbackQueue.enqueue(event.data); // your playback implementation }); // Clear playback when the model is interrupted (for example, the user starts speaking) session.on("audio_interrupted", () => { playbackQueue.clear(); }); // Send mic audio to the model as 16-bit PCM chunks at 24 kHz arrive startMicCapture((pcm16) => session.sendAudio(pcm16)); // your mic capture implementation // Optional: prompt the agent to greet the user immediately rather than // waiting for the user to speak first. session.sendMessage("(Session started. Greet the user briefly.)");
execute functions to integrate with the Ontology.