DaQianAI

Realtime transcription

Beta

==============================

Learn how to transcribe audio in real-time with the Realtime API.

You can use the Realtime API for transcription-only use cases, either with input from a microphone or from a file. For example, you can use it to generate subtitles or transcripts in real-time. With the transcription-only mode, the model will not generate responses.

If you want the model to produce responses, you can use the Realtime API in speech-to-speech conversation mode.

Realtime transcription sessions

To use the Realtime API for transcription, you need to create a transcription session, connecting via WebSockets or WebRTC.

Unlike the regular Realtime API sessions for conversations, the transcription sessions typically don't contain responses from the model.

The transcription session object is also different from regular Realtime API sessions:

json

{
  object: "realtime.transcription_session",
  id: string,
  input_audio_format: string,
  input_audio_transcription: [{
    model: string,
    prompt: string,
    language: string
  }],
  turn_detection: {
    type: "server_vad",
    threshold: float,
    prefix_padding_ms: integer,
    silence_duration_ms: integer,
  } | null,
  input_audio_noise_reduction: {
    type: "near_field" | "far_field"
  },
  include: list[string] | null
}

Some of the additional properties transcription sessions support are:

input_audio_transcription.model: The transcription model to use, currently gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1 are supported
input_audio_transcription.prompt: The prompt to use for the transcription, to guide the model (e.g. "Expect words related to technology")
input_audio_transcription.language: The language to use for the transcription, ideally in ISO-639-1 format (e.g. "en", "fr"...) to improve accuracy and latency
input_audio_noise_reduction: The noise reduction configuration to use for the transcription
include: The list of properties to include in the transcription events

Possible values for the input audio format are: pcm16 (default), g711_ulaw and g711_alaw.

You can find more information about the transcription session object in the API reference.

Handling transcriptions

When using the Realtime API for transcription, you can listen for the conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed events.

For whisper-1 the delta event will contain full turn transcript, same as completed event. For gpt-4o-transcribe and gpt-4o-mini-transcribe the delta event will contain incremental transcripts as they are streamed out from the model.

Here is an example transcription delta event:

json

{
  "event_id": "event_2122",
  "type": "conversation.item.input_audio_transcription.delta",
  "item_id": "item_003",
  "content_index": 0,
  "delta": "Hello,"
}

Here is an example transcription completion event:

json

{
  "event_id": "event_2122",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_003",
  "content_index": 0,
  "transcript": "Hello, how are you?"
}

Note that ordering between completion events from different speech turns is not guaranteed. You should use item_id to match these events to the input_audio_buffer.committed events and use input_audio_buffer.committed.previous_item_id to handle the ordering.

To send audio data to the transcription session, you can use the input_audio_buffer.append event.

You have 2 options:

Use a streaming microphone input
Stream data from a wav file

Voice activity detection

The Realtime API supports automatic voice activity detection (VAD). Enabled by default, VAD will control when the input audio buffer is committed, therefore when transcription begins.

Read more about configuring VAD in our Voice Activity Detection guide.

You can also disable VAD by setting the turn_detection property to null, and control when to commit the input audio on your end.

Additional configurations

Noise reduction

You can use the input_audio_noise_reduction property to configure how to handle noise reduction in the audio stream.

The possible values are:

near_field: Use near-field noise reduction.
far_field: Use far-field noise reduction.
null: Disable noise reduction.

The default value is near_field, and you can disable noise reduction by setting the property to null.

Using logprobs

You can use the include property to include logprobs in the transcription events, using item.input_audio_transcription.logprobs.

Those logprobs can be used to calculate the confidence score of the transcription.

json

{
  "type": "transcription_session.update",
  "input_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "gpt-4o-transcribe",
    "prompt": "",
    "language": ""
  },
  "turn_detection": {
    "type": "server_vad",
    "threshold": 0.5,
    "prefix_padding_ms": 300,
    "silence_duration_ms": 500,
  },
  "input_audio_noise_reduction": {
    "type": "near_field"
  },
  "include": [ 
    "item.input_audio_transcription.logprobs",
  ],
}

Realtime transcription sessions ​

Handling transcriptions ​

Voice activity detection ​

Additional configurations ​

Noise reduction ​

Using logprobs ​

Realtime transcription sessions

Handling transcriptions

Voice activity detection

Additional configurations

Noise reduction

Using logprobs