AI Platform (AIP)Realtime audioBuild a voice-enabled OSDK application

Build a voice-enabled OSDK application

Recording and consent are your responsibility

Before deploying any application that records or transcribes human speech, ensure participants are notified and that you have consent flows in place where required by the jurisdictions you operate in. This may apply whether the participant is the application user, a third party on a call, or anyone else whose voice is captured. See Recording, transcription, and consent.

This tutorial walks through building an OSDK application that captures microphone audio, streams it to a realtime model, and plays back the model's spoken response. The application authenticates as a Foundry user, so every interaction is grounded in that user's permissions and the Ontology they can access. For a high-level overview of realtime audio on Foundry, see Realtime audio.

What you will build

By the end of this tutorial you will have a TypeScript application running in the browser that:

Authenticates the user via the standard OSDK OAuth flow.
Opens a WebSocket connection directly from the browser to a Foundry proxy endpoint that forwards to a realtime model.
Gives the agent a tool that reads from the Ontology, running with the user's permissions.
Streams microphone audio to the model and plays back the model's spoken response.

Prerequisites

Before you begin, ensure you have the following:

An OSDK application set up via @osdk/create-app. The generated application includes the OAuth scaffolding this tutorial relies on. For a full reference on the OSDK itself, see Ontology SDK.
An unrestricted custom application registered in Developer Console. The OSDK frontend authenticates against this application. Unrestricted is required so the access token can include the language-model-service:use-model scope used in Step 1.
A realtime model enabled on your enrollment. See Available audio models for the list. This tutorial uses gpt-realtime-2.
The @openai/agents package installed in your application:

Copied!1
npm install @openai/agents

Step 1: Obtain an OAuth access token

The OSDK application generated by @osdk/create-app includes an OAuth client in src/client.ts. The client is created using createPublicOauthClient from @osdk/oauth and exposes a method to retrieve the current access token. The values for clientId, foundryUrl, and redirectUrl are read at runtime from osdk-* meta tags injected into index.html by @osdk/create-app based on your .env configuration.

Configure the OAuth client with the language-model-service:use-model scope so the resulting access token is permitted to connect to the realtime endpoint. Without this scope, the realtime endpoint rejects the WebSocket connection during authentication. This scope is only available on unrestricted custom applications — verify your application is unrestricted before continuing.

The snippet below shows the full shape of src/client.ts after adding the realtime audio scope. Most of this file is generated by @osdk/create-app; the only change needed for this tutorial is including language-model-service:use-model in the scopes array. The exports are used elsewhere in the tutorial: auth for token retrieval below, foundryUrl for building the WebSocket URL in Step 2, and client for OSDK queries in Step 3.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import { createClient, type Client } from "@osdk/client";
import { createPublicOauthClient, type PublicOauthClient } from "@osdk/oauth";

function getMetaTagContent(tagName: string): string {
  const elements = document.querySelectorAll(`meta[name="${tagName}"]`);
  const element = elements.item(elements.length - 1);
  const value = element ? element.getAttribute("content") : null;
  if (value == null || value === "") {
    throw new Error(`Meta tag ${tagName} not found or empty`);
  }
  return value;
}

export const foundryUrl = getMetaTagContent("osdk-foundryUrl");
const clientId = getMetaTagContent("osdk-clientId");
const redirectUrl = getMetaTagContent("osdk-redirectUrl");
const ontologyRid = getMetaTagContent("osdk-ontologyRid");

const scopes = [
  "language-model-service:use-model",
  // Other scopes your application requires, for example:
  // "api:read-data",
  // "api:use-ontologies-read",
];

export const auth: PublicOauthClient = createPublicOauthClient(
  clientId,
  foundryUrl,
  redirectUrl,
  { scopes }
);

export const client: Client = createClient(foundryUrl, ontologyRid, auth);
export default client;

To retrieve the access token at runtime, call auth.getTokenOrUndefined(). If the user has not signed in, call auth.signIn() first and then retry:

Copied!1
2
3
4
5
6
7
8
9
10
import { auth } from "../client";

let token = await auth.getTokenOrUndefined();
if (!token) {
  await auth.signIn();
  token = await auth.getTokenOrUndefined();
}
if (!token) {
  throw new Error("Unable to obtain access token. Please try again.");
}

Step 2: Connect to the realtime endpoint

The realtime endpoint accepts a WebSocket connection at the following URL, where <your-foundry-domain> is the hostname of your Foundry environment (the same value used as foundryUrl in src/client.ts) and the model query parameter selects the realtime model to use:

wss://<your-foundry-domain>/language-model-service/ws/v1/open-ai/realtime?model=gpt-realtime-2

This is a Foundry proxy endpoint, analogous to the REST LLM-provider compatible APIs but exposed over a WebSocket. The proxy forwards the session to the underlying provider (OpenAI Direct or Azure OpenAI, depending on your enrollment) and preserves the provider's native realtime protocol. You can therefore use the OpenAI realtime SDK (@openai/agents/realtime) without modification.

The access token is passed as a WebSocket subprotocol value because browsers cannot set custom Authorization headers on WebSocket connections. Format the subprotocol value as Bearer-<access-token>.

Treat the subprotocol value as a credential

The access token is visible in the browser's developer tools network panel as part of the WebSocket subprotocol value. Treat it as a credential: do not log it, do not surface it in user-facing strings, and do not transmit it to any system other than your Foundry endpoint. Access tokens are short-lived; the WebSocket session uses the token it was opened with, so a long-running session may need to be reconnected when the token expires.

Open the connection using the OpenAI realtime SDK, supplying a custom createWebSocket function that constructs the browser's WebSocket with the subprotocol value. The full connect flow is wrapped in an async function called startVoiceSession that returns once the session is ready. The function is built from three small named pieces, defined below and then composed at the end of the section. Save the result as, for example, src/voice/session.ts.

Start with the imports and constants:

Copied!1
2
3
4
5
6
7
8
9
10
11
import {
  RealtimeAgent,
  RealtimeSession,
  OpenAIRealtimeWebSocket,
} from "@openai/agents/realtime";
import { foundryUrl } from "../client";

const PROXY_URL =
  `wss://${new URL(foundryUrl).host}/language-model-service/ws/v1/open-ai/realtime` +
  `?model=gpt-realtime-2`;
const CONNECT_TIMEOUT_MS = 10_000;

Define createTransport to construct the transport. The createWebSocket callback builds the browser WebSocket with the bearer token subprotocol and attaches error and close handlers that log diagnostic information. The useInsecureApiKey: true flag disables an SDK browser-only guard against using a non-ephemeral OpenAI key. It does not apply here because authentication is handled by the WebSocket subprotocol, not by an OpenAI key.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function createTransport(token: string): OpenAIRealtimeWebSocket {
  return new OpenAIRealtimeWebSocket({
    url: PROXY_URL,
    useInsecureApiKey: true, // auth is via the subprotocol below, not OpenAI's apiKey
    // eslint-disable-next-line @typescript-eslint/no-explicit-any
    createWebSocket: async ({ url }: { url: string }): Promise<any> => {
      const ws = new WebSocket(url, [`Bearer-${token}`]);

      ws.addEventListener("error", () => {
        console.error("WebSocket connection failed (check network tab for details)");
      });

      ws.addEventListener("close", (ev) => {
        // 1000 = normal close, do not surface as an error.
        if (ev.code !== 1000) {
          console.error(`WebSocket closed: ${ev.code}${ev.reason ? ` (${ev.reason})` : ""}`);
        }
      });

      return ws;
    },
  });
}

Define createSession to construct the session with audio input/output formats, server-side voice activity detection, and an inline transcription model so the user's spoken audio is transcribed as the conversation runs:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function createSession(
  agent: RealtimeAgent,
  transport: OpenAIRealtimeWebSocket,
): RealtimeSession {
  return new RealtimeSession(agent, {
    transport,
    model: "gpt-realtime-2",
    tracingDisabled: true,
    config: {
      outputModalities: ["audio"],
      audio: {
        input: {
          format: "pcm16",
          transcription: { model: "whisper-1" },
          turnDetection: {
            type: "server_vad",
            threshold: 0.8,
            silenceDurationMs: 500,
            prefixPaddingMs: 300,
          },
        },
        output: { format: "pcm16" },
      },
    },
  });
}

Define awaitSessionReady to wait for the session to become usable. Calling session.connect() returns before the session can send and receive audio, so wait for session.created and then the first session.updated event. The 200 ms delay after that gives the SDK time to settle; without it, early audio chunks may be lost. If the session does not reach this state within CONNECT_TIMEOUT_MS, the promise rejects with a timeout error. The apiKey field is unused because authentication is handled by the WebSocket subprotocol; the SDK simply requires a non-empty string:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function awaitSessionReady(session: RealtimeSession): Promise<RealtimeSession> {
  return new Promise<RealtimeSession>((resolve, reject) => {
    const timeout = setTimeout(() => {
      reject(new Error("Connection timeout"));
    }, CONNECT_TIMEOUT_MS);

    let created = false;
    let updatedCount = 0;

    session.on("transport_event", (ev) => {
      if (ev.type === "session.created") {
        created = true;
      }
      if (ev.type === "session.updated" && created) {
        updatedCount++;
        if (updatedCount === 1) {
          setTimeout(() => {
            clearTimeout(timeout);
            resolve(session);
          }, 200);
        }
      }
    });

    session.connect({ apiKey: "unused" });
  });
}

Finally, compose the three pieces into startVoiceSession:

Copied!1
2
3
4
5
6
7
8
export async function startVoiceSession(
  agent: RealtimeAgent,
  token: string,
): Promise<RealtimeSession> {
  const transport = createTransport(token);
  const session = createSession(agent, transport);
  return awaitSessionReady(session);
}

Call startVoiceSession(agent, token) once you have the access token from Step 1 and an agent definition (see Step 3). The function returns when the session is ready to send and receive audio.

Step 3: Give the agent a tool that reads from the Ontology

The example above sets up a generic assistant. To make it useful, give the agent tools that read from and write to the Ontology. The OpenAI realtime SDK supports tool calls: define a tool, register it on the agent, and the model invokes the tool's execute function when it decides it needs the information.

The tool's execute function runs in the browser with the user's OAuth token and OSDK client. The model never sees the Ontology directly — it sees only what the tool returns. You decide what data to expose, what filters to apply, and what to write back.

The example below shows a tool stub. Replace the body with your own OSDK queries against your Ontology. For OSDK query syntax and examples, see TypeScript OSDK.

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import { RealtimeAgent, tool } from "@openai/agents/realtime";
import { z } from "zod";
import { client } from "../client";

const lookupProduct = tool({
  name: "lookup_product",
  description: "Look up a product by name. Use this when the user asks about a specific product.",
  parameters: z.object({
    name: z.string().describe("The product name to look up"),
  }),
  execute: async ({ name }) => {
    // TODO: replace this stub with an OSDK query against your Ontology.
    // For example, if your Ontology has a Product object type:
    //
    //   const page = await client(Product)
    //     .where({ name: { $startsWith: name } })
    //     .fetchPage({ $pageSize: 5 });
    //   return JSON.stringify(page.data.map((p) => ({
    //     id: p.id, name: p.name, price: p.price, stock: p.stock,
    //   })));
    //
    return JSON.stringify({
      id: "stub-1",
      name,
      price: 0,
      stock: 0,
      note: "Replace the tool body with an OSDK query against your Ontology.",
    });
  },
});

const agent = new RealtimeAgent({
  name: "Assistant",
  instructions:
    "You help users find products. When the user asks about a product, call the lookup_product tool.",
  tools: [lookupProduct],
});

Pass this agent into startVoiceSession(agent, token) from Step 2. When the user speaks, the model decides whether to call the tool. The tool's execute function runs in the browser against the OSDK client, and the result is fed back to the model for the response.

Tools can also write to the Ontology. To do this, call an action from inside the tool's execute function. Actions go through the standard Foundry permission and validation paths.

Step 4: Stream microphone audio and play back the response

Given the RealtimeSession returned by startVoiceSession, wire up microphone capture and audio playback. Capture microphone audio using the browser's MediaDevices API, convert it to 16-bit PCM at 24 kHz, and forward each chunk to the session via session.sendAudio(chunk) as the chunks arrive. The application enqueues the audio chunks received from the session for playback.

The microphone capture and playback queue implementation is out of scope for this tutorial. The session-level wiring for receiving audio looks like:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
const session = await startVoiceSession(agent, token);

// Surface mid-session errors from the SDK. WebSocket-level transport errors
// (such as a failure to establish the connection) are logged to the console
// from the handlers in `createTransport`; if the connection never establishes,
// `startVoiceSession` rejects after `CONNECT_TIMEOUT_MS`.
session.on("error", (event) => {
  console.error("Realtime session error:", event.error);
});

// Play back audio from the model
session.on("audio", (event) => {
  playbackQueue.enqueue(event.data); // your playback implementation
});

// Clear playback when the model is interrupted (for example, the user starts speaking)
session.on("audio_interrupted", () => {
  playbackQueue.clear();
});

// Send mic audio to the model as 16-bit PCM chunks at 24 kHz arrive
startMicCapture((pcm16) => session.sendAudio(pcm16)); // your mic capture implementation

// Optional: prompt the agent to greet the user immediately rather than
// waiting for the user to speak first.
session.sendMessage("(Session started. Greet the user briefly.)");

Next steps

Build out your application. Wire the session into your frontend, add the tools your application needs, and call OSDK queries and actions from inside the tool execute functions to integrate with the Ontology.
Improve agent instructions. See the OpenAI realtime prompting guide ↗ for guidance on writing effective instructions for realtime models.

←

PREVIOUSOverview

NEXTAIP Threads / Overview

→