Speech Recognition

The Speech Recognition API creates a transcript of the text in an audio or video file. You can then use this output with other Haven OnDemand APIs, such as Concept Extraction or Add to Text Index, to gain further insight and analysis.

The Speech Recognition API currently supports broadcast-quality content in several languages, as well as telephony grade audio for some of those languages.

Quick Start

You can upload a video to the API as a file, in which case you must use a POST method.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "[email protected]"

You can also input a URL or a Haven OnDemand reference.

Note: Because input files for this API can be large and take a long time to process, the API runs only in asynchronous mode. See Get the Results.

Note: This API has rate and duration limits:

Input files are truncated after 30 minutes.
The processing terminates after two hours and returns only what it has completed within that time.

For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

The API provides a segmented transcription of the entire file. It also returns a start and end time offset and a confidence value for each word in the output.

For example, in our sample file, hpnext.mp4 returns (note: example output truncated to the first few items):


"items": [{
                "start_time_offset": 0.84,
                "end_time_offset": 0.95,
                "text": "we",
                "confidence": 97
}, {
                "start_time_offset": 0.95,
                "end_time_offset": 1.14,
                "text": "want",
                "confidence": 70
}, {
                "start_time_offset": 1.14,
                "end_time_offset": 1.2,
                "text": "to",
                "confidence": 78
}, {
                "start_time_offset": 1.2,
                "end_time_offset": 1.41,
                "text": "hear",
                "confidence": 90
}, {
                "start_time_offset": 1.41,
                "end_time_offset": 1.61,
                "text": "from",
                "confidence": 95
}, {
                "start_time_offset": 1.61,
                "end_time_offset": 1.93,
                "text": "you",
                "confidence": 86
}, {
                "start_time_offset": 2.45,
                "end_time_offset": 2.67,
                "text": "let's",
                "confidence": 91
}]

If you provide a URL, it must link directly to an audio or video file. You cannot link to a page with an embedded video (such as a news page or YouTube link).

/1/api/async/recognizespeech/v2?url=https://www.havenondemand.com/sample-content/videos/hpnext.mp4&language_model=en-US

Specify the Language

You must specify the language model for the recognition engine to use via the language_model parameter.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "[email protected]"

The Recognize Speech API provides Broadband and Telephony language options. Each language is trained over a large body of representative data. The Broadband language options are trained on many hours of broadcast-quality content, such as TV news programs, while the Telephony language options are trained over many hours of voice calls.

For the highest accuracy, use the option and model that most closely resembles your voice data. For example, if you are processing voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

For more information about how to pick the appropriate language option to get the best results for your data, see Language Models.

Get the Results

The asynchronous mode returns a job-id, which you can then use to extract your results. There are two methods for this:

Use /1/job/status/ to get the status of the job, including results if the job is finished.
Use /1/job/result/, which waits until the job has finished and then returns the result.

Note: Because /result has to wait for the job to finish before it can return a response, using it for longer operations such as processing a large video file can result in an HTTP request timeout response. The /result method returns a response either when the result is available, or after 120 seconds, whichever is sooner. If the job is not complete after 120 seconds, the /result method returns a code 7010 (job result request timeout) response. This means that your asynchronous job is still in progress. To avoid the timeout, use /status instead.

Optimize Results

The quality of the audio file that you send can have a large effect on the quality of the speech recognition output. For example, the location of the microphone, background noise, and audio compression can all have an effect on how well the Speech Recognition API detects the words in a particular audio file. For more information on how to get the best results from this API, see Speech Processing Concepts.

Parameter	Description
apikey	The API key to use to authenticate the API request.

Parameter

Description

apikey

The API key to use to authenticate the API request.

Required
Name	Type	Description
file	binary	A media file containing the speech to transcribe. Multipart POST only.
reference	string	A Haven OnDemand object store reference obtained from either the Expand Container or Store Object API. The corresponding video is passed to the API.
url	string	A publicly accessible HTTP URL from which a video or audio file can be retrieved.
language_model	resource	The language of the provided speech. For the highest accuracy, use the option and model that most closely resembles your voice data. For example, to process voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

Required

Name

Type

Description

file

binary

A media file containing the speech to transcribe. Multipart POST only.

reference

string

A Haven OnDemand object store reference obtained from either the Expand Container or Store Object API. The corresponding video is passed to the API.

url

string

A publicly accessible HTTP URL from which a video or audio file can be retrieved.

language_model

resource

The language of the provided speech. For the highest accuracy, use the option and model that most closely resembles your voice data. For example, to process voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

{ "properties": { "source_information": { "properties": { "mime_type": { "type": "string" }, "video_information": { "properties": { "width": { "type": "integer", "minimum": 1 }, "height": { "type": "integer", "minimum": 1 }, "codec": { "type": "string" }, "pixel_aspect_ratio": { "type": "string" } }, "type": "object", "required": [ "width", "height", "codec", "pixel_aspect_ratio" ] }, "audio_information": { "properties": { "codec": { "type": "string" }, "sample_rate": { "type": "integer" }, "channels": { "type": "integer" } }, "type": "object", "required": [ "codec", "sample_rate", "channels" ] } }, "required": [ "mime_type" ], "type": "object" }, "items": { "items": { "properties": { "start_time_offset": { "type": "number" }, "end_time_offset": { "type": "number" }, "text": { "type": "string" }, "confidence": { "type": "number" } }, "required": [ "start_time_offset", "end_time_offset", "text", "confidence" ], "type": "object" }, "type": "array" } }, "required": [ "items" ], "type": "object" }

Speech Recognition Response {
	source_information ( Source_information , optional)	Metadata information about a media file
	items ( array[Items] )	The format of speech transcription results in the response.
}

Speech Recognition Response:Source_information {
	mime_type ( string )	MIME type of the document.
	video_information ( Video_information , optional)	Information about the video track if one is present.
	audio_information ( Audio_information , optional)	Information about the audio track if one is present.
}

Speech Recognition Response:Source_information:Video_information {
	width ( integer )	The width of the video in pixels.
	height ( integer )	The height of the video in pixels.
	codec ( string )	The algorithm used to encode the video.
	pixel_aspect_ratio ( string )	The aspect ratio of pixels in the video. For example, if the video is made up of square pixels this value is 1:1.
}

Speech Recognition Response:Source_information:Audio_information {
	codec ( string )	The algorithm used to encode the audio.
	sample_rate ( integer )	The frequency at which the audio was sampled.
	channels ( integer )	The number of channels present in the audio. For example, for stereo this value is 2.
}

Speech Recognition Response:Items {
	start_time_offset ( number )	Time from the start of the media to where the word starts. This value is expressed as a non integer number.
	end_time_offset ( number )	Time from the start of the media to where the word ends. This value is expressed as a non integer number.
	text ( string )	The word(s) being spoken at the specified time.
	confidence ( number )	A value (0-100) of confidence in the transcription.
}

Notifications