You can upload a video to the API as a file, in which case you must use a POST method.
curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "[email protected]"
You can also input a URL or a Haven OnDemand reference.
Note: Because input files for this API can be large and take a long time to process, the API runs only in asynchronous mode. See Get the Results.
Note: This API has rate and duration limits:
- Input files are truncated after 30 minutes.
- The processing terminates after two hours and returns only what it has completed within that time.
For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.
The API provides a segmented transcription of the entire file. It also returns a start and end time offset and a confidence value for each word in the output.
For example, in our sample file, hpnext.mp4 returns (note: example output truncated to the first few items):
"items": [{
"start_time_offset": 0.84,
"end_time_offset": 0.95,
"text": "we",
"confidence": 97
}, {
"start_time_offset": 0.95,
"end_time_offset": 1.14,
"text": "want",
"confidence": 70
}, {
"start_time_offset": 1.14,
"end_time_offset": 1.2,
"text": "to",
"confidence": 78
}, {
"start_time_offset": 1.2,
"end_time_offset": 1.41,
"text": "hear",
"confidence": 90
}, {
"start_time_offset": 1.41,
"end_time_offset": 1.61,
"text": "from",
"confidence": 95
}, {
"start_time_offset": 1.61,
"end_time_offset": 1.93,
"text": "you",
"confidence": 86
}, {
"start_time_offset": 2.45,
"end_time_offset": 2.67,
"text": "let's",
"confidence": 91
}]
If you provide a URL, it must link directly to an audio or video file. You cannot link to a page with an embedded video (such as a news page or YouTube link).
/1/api/async/recognizespeech/v2?url=https://www.havenondemand.com/sample-content/videos/hpnext.mp4&language_model=en-US
Specify the Language
You must specify the language model for the recognition engine to use via the language_model parameter.
curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "[email protected]"
The Recognize Speech API provides Broadband and Telephony language options. Each language is trained over a large body of representative data. The Broadband language options are trained on many hours of broadcast-quality content, such as TV news programs, while the Telephony language options are trained over many hours of voice calls.
For the highest accuracy, use the option and model that most closely resembles your voice data. For example, if you are processing voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.
For more information about how to pick the appropriate language option to get the best results for your data, see Language Models.
Get the Results
The asynchronous mode returns a job-id, which you can then use to extract your results. There are two methods for this:
- Use
/1/job/status/to get the status of the job, including results if the job is finished. Use
/1/job/result/, which waits until the job has finished and then returns the result.Note: Because
/resulthas to wait for the job to finish before it can return a response, using it for longer operations such as processing a large video file can result in an HTTP request timeout response. The/resultmethod returns a response either when the result is available, or after 120 seconds, whichever is sooner. If the job is not complete after 120 seconds, the/resultmethod returns a code 7010 (job result request timeout) response. This means that your asynchronous job is still in progress. To avoid the timeout, use/statusinstead.
Optimize Results
The quality of the audio file that you send can have a large effect on the quality of the speech recognition output. For example, the location of the microphone, background noise, and audio compression can all have an effect on how well the Speech Recognition API detects the words in a particular audio file. For more information on how to get the best results from this API, see Speech Processing Concepts.