Realtime transcription
Beta
==============================
Learn how to transcribe audio in real-time with the Realtime API.
You can use the Realtime API for transcription-only use cases, either with input from a microphone or from a file. For example, you can use it to generate subtitles or transcripts in real-time. With the transcription-only mode, the model will not generate responses.
If you want the model to produce responses, you can use the Realtime API in speech-to-speech conversation mode.
Realtime transcription sessions
To use the Realtime API for transcription, you need to create a transcription session, connecting via WebSockets or WebRTC.
Unlike the regular Realtime API sessions for conversations, the transcription sessions typically don't contain responses from the model.
The transcription session object is also different from regular Realtime API sessions:
{
object: "realtime.transcription_session",
id: string,
input_audio_format: string,
input_audio_transcription: [{
model: string,
prompt: string,
language: string
}],
turn_detection: {
type: "server_vad",
threshold: float,
prefix_padding_ms: integer,
silence_duration_ms: integer,
} | null,
input_audio_noise_reduction: {
type: "near_field" | "far_field"
},
include: list[string] | null
}
Some of the additional properties transcription sessions support are:
input_audio_transcription.model
: The transcription model to use, currentlygpt-4o-transcribe
,gpt-4o-mini-transcribe
, andwhisper-1
are supportedinput_audio_transcription.prompt
: The prompt to use for the transcription, to guide the model (e.g. "Expect words related to technology")input_audio_transcription.language
: The language to use for the transcription, ideally in ISO-639-1 format (e.g. "en", "fr"...) to improve accuracy and latencyinput_audio_noise_reduction
: The noise reduction configuration to use for the transcriptioninclude
: The list of properties to include in the transcription events
Possible values for the input audio format are: pcm16
(default), g711_ulaw
and g711_alaw
.
You can find more information about the transcription session object in the API reference.
Handling transcriptions
When using the Realtime API for transcription, you can listen for the conversation.item.input_audio_transcription.delta
and conversation.item.input_audio_transcription.completed
events.
For whisper-1
the delta
event will contain full turn transcript, same as completed
event. For gpt-4o-transcribe
and gpt-4o-mini-transcribe
the delta
event will contain incremental transcripts as they are streamed out from the model.
Here is an example transcription delta event:
{
"event_id": "event_2122",
"type": "conversation.item.input_audio_transcription.delta",
"item_id": "item_003",
"content_index": 0,
"delta": "Hello,"
}
Here is an example transcription completion event:
{
"event_id": "event_2122",
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "item_003",
"content_index": 0,
"transcript": "Hello, how are you?"
}
Note that ordering between completion events from different speech turns is not guaranteed. You should use item_id
to match these events to the input_audio_buffer.committed
events and use input_audio_buffer.committed.previous_item_id
to handle the ordering.
To send audio data to the transcription session, you can use the input_audio_buffer.append
event.
You have 2 options:
- Use a streaming microphone input
- Stream data from a wav file
Voice activity detection
The Realtime API supports automatic voice activity detection (VAD). Enabled by default, VAD will control when the input audio buffer is committed, therefore when transcription begins.
Read more about configuring VAD in our Voice Activity Detection guide.
You can also disable VAD by setting the turn_detection
property to null
, and control when to commit the input audio on your end.
Additional configurations
Noise reduction
You can use the input_audio_noise_reduction
property to configure how to handle noise reduction in the audio stream.
The possible values are:
near_field
: Use near-field noise reduction.far_field
: Use far-field noise reduction.null
: Disable noise reduction.
The default value is near_field
, and you can disable noise reduction by setting the property to null
.
Using logprobs
You can use the include
property to include logprobs in the transcription events, using item.input_audio_transcription.logprobs
.
Those logprobs can be used to calculate the confidence score of the transcription.
{
"type": "transcription_session.update",
"input_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-transcribe",
"prompt": "",
"language": ""
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500,
},
"input_audio_noise_reduction": {
"type": "near_field"
},
"include": [
"item.input_audio_transcription.logprobs",
],
}