Automatic Speech Recognition, Shownotes and Chapters

Auphonic has built a layer on top of Automatic Speech Recognition Services:
Our classifiers generate metadata during the analysis of an audio signal (music segments, silence, multiple speakers, etc.) to divide the audio file into small and meaningful segments, which are then processed by the speech recognition engine. The speech recognition services support multiple languages and return text results for all audio segments.
The results from all segments are then combined, and meaningful timestamps, simple punctuation and structuring are added to the resulting text.
With enabled Automatic Shownotes and Chapters Feature, you can also get AI-generated summaries, tags and chapters from your audio, that automatically show up in your result files and in your audio files’ metadata.

This is especially interesting when used with our Multitrack Algorithms, where all speakers have been recorded on separate tracks:
In Multitrack Speech Recognition, all the tracks are processed by the speech recognition service individually, so there is no crosstalk or spill and the recognition is much more accurate, as it is only dealing with one speaker at a time.
This also means that we can show individual speaker names in the transcript output file and audio player because we know exactly who is saying what at any given time.

Automatic Speech Recognition (ASR) is most useful to make audio searchable: Although automatically generated transcripts won’t be perfect and might be difficult to read (spoken text is very different from written text), they are very valuable if you try to find a specific topic within a one hour audio file or the exact time of a quote in an audio archive.

We also include a complete Transcript Editor directly in our HTML output file, which displays word confidence values to instantly see which sections should be checked manually, supports direct audio playback, HTML/PDF/WebVTT export and allows you to share the editor with someone else for further editing.

../_images/inspector-mt-closed.png


Our WebVTT-based audio player with search in speech recognition transcripts and exact speaker names of a Multitrack Production.

How to use Speech Recognition within Auphonic

To activate speech recognition within Auphonic, you can either select our self-hosted Auphonic Whisper ASR service or one of the integrated external services in the section Speech Recognition and Automatic Shownotes when creating a new Production or Preset.
If you want to use an external service, you first have to connect your Auphonic account to an external speech recognition service at the External Services page. This connection process is very different depending on the speech recognition provider, therefore please visit our step-by-step tutorials. Once that is done, you will have the option to choose whichever service you signed up for.
For our self-hosted service, you can skip the connection process and directly select Auphonic Whisper ASR as service:

../_images/SpeechRecognition_WhisperShownotes_2306.png


The speech recognition transcripts will be combined with our other Audio Algorithms and generate different Output Formats, depending on your selections.

Automatic Shownotes and Chapters

With this feature you can fully automate the time-consuming process of summarizing and structuring the content of your recording to find useful positions for chapter marks and to create shownotes. The autogenerated data is also stored in the metadata of your recording.

Shownotes usually contain short summaries of the main topics of your episode, links and other information, while inserted chapter marks allow you to timestamp sections with different topics of a podcast or video.
They make your content more accessible and user-friendly, enabeling listeners to quickly navigate to specific sections of the episode or find a previous episode to brush up on a particular topic.

Besides the obvious use of creating shownotes and chapters for podcasts, you can also use this feature to easily generate an abstract of your lecture recording, take the summary of your show as the starting point for a social media post, or choose your favorite chapter title as the podcast name.

How does it work?

When the Automatic Shownotes and Chapters feature is selected, the first step is speech transcription by either our internal Auphonic Whisper ASR or any integrated External ASR Service of your choice.

Some open source tools and ChatGPT will then summarize the ASR resulting text in different levels of detail, analyze the content to identify sections with the different topics discussed, and finally complete each section with timestamps for easy navigation.
Beginning with the generation of a Long Summary, the number of characters is further reduced for a Brief Summary and from the brief summary a Subtitle and some Keywords for the main topics are extracted.

Depending on the duration of the input audio or video file, the level of detail of the thematic sections is also slightly adjusted, resulting in a reasonable number of chapters for very short 5-minute audio files as well as for long 180-minute audio files.

Automatically Generate Data

You can automatically generate shownotes and chapters by checking the Automatic Shownotes and Chapters Checkbox in the Auphonic singletrack or multitrack Production Form with any of our ASR Services enabled.

../_images/SpeechRecognition_WhisperShownotes_2306.png

Once your production is done, the generated data will show up in your transcript result files and in the Auphonic Transcript Editor above the speech recognition transcript section, like you can see in our Example Production.
By clicking on a chapter title in the Chapters section of the transcript editor, you can jump directly to that chapter in your transcript to review and edit that section.
Unless you have manually entered content before, the generated data will also be stored in your audio files’ metadata as follows:

  • Generated Long Summary stored in metadata field Summary.

  • Generated Subtitle stored in metadata field Subtitle.

  • Generated Keywords stored in metadata field Tags.

  • Generated Timestamps for thematic sections stored as Start Time of Chapters Marks.

  • Generated Headlines for thematic sections stored as Chapter Title of Chapters Marks.

Please note that not all of our supported Output File Formats are designed to use metadata. For details about metadata see our previous blog posts: ID3 Tags Metadata (used in MP3 output files), Vorbis Comment Metadata (used in FLAC, Opus and Ogg Vorbis output files) and MPEG-4 iTunes-style Metadata (used in AAC, M4A/M4B/MP4 and ALAC output files).

Edit Autogenerated Data

If you are unhappy with any detail in the generated keywords, summaries or chapters used as metadata, you can return to your production’s status page and click Edit Production under the feedback section. The previously generated metadata will now show up in the production form (as Extended Metadata and Chapter Marks) for you to edit.

When you are finished editing, simply rerun the production to update your changes to your audio file’s metadata. Re-running the production will NOT cost you any additional credits on your Auphonic account if you only change the metadata and settings, but do NOT change the inputfile! Manually edited metadata would not be overwritten, however, we do recommend turning off Speech Recognition and Shownote Generation for the second run to save time and avoid additional costs for external ASR services.

Of course, you could also use any third-party software of your choice to edit the metadata after downloading.

Auphonic Whisper ASR

Using OpenAI’s open-source model Whisper, we offer a self-hosted automatic speech recognition (ASR) service.
For an overview and comparison to our integrated external ASR services, please see our Services Comparison Table.
Most important facts about Auphonic Whisper ASR:

Price and Supported languages

Whisper supports transcriptions in about 100 languages, which you can integrate into your Auphonic audio post-production workflow without creating an external account and at no additional cost.
Except for strong accents Whisper also provides a reliable language autodetection feature.

Timestamps, Confidence Values and Punctuation

Whisper returns timestamps and confidence values, that allow you to locate specific phrases in your recording, and combined with our Transcript Editor, you can easily find sections that should be checked manually.

ASR Speed and Quality

By using Auphonic Whisper ASR, your data does not have to leave our Auphonic servers for speech recognition processing, which speeds up ASR processing. The quality of Whisper transcripts is absolutely comparable to the “best” services in our comparison table.

Besides the fact, that Whisper is integrated in the Auphonic web service per default and you need no external account, causing extra costs, the outstanding feature with Whisper is the automatic language detection.

Integrated Speech Recognition Services

Besides our self-hosted Auphonic Whisper ASR (automatic speech recognition) service, we also support the following integrated external ASR services: Wit.ai, Google Cloud Speech API, Amazon Transcribe and Speechmatics.
For an overview and comparison, please see our Services Comparison Table.

Wit.ai

  • Wit.ai, owned by Facebook, provides an online natural language processing platform, which also includes speech recognition.

  • Wit is free, including for commercial use. See FAQ and Terms .

  • It supports many languages, but you have to create a separate Wit.ai service for every language you use!

Google Cloud Speech API

  • Google Cloud Speech API is the speech to text engine developed by Google and supports over 100 languages.

  • 60 minutes of audio per month and model are free, for more see Pricing (about $1.5/h for default model).

  • For English (en-US), we use the more expensive Enhanced Model (about $2.2/h for enhanced model), which gives much better results.

  • In your Google Cloud account you can optionally give Google permission to apply Data Logging to reduce your costs to ~$1/h default and ~$1.5/h enhanced model.

  • It is possible to add keywords to improve speech recognition accuracy for specific words and phrases or to add additional words to the vocabulary. For details see Word and Phrase Hints.

A great feature of the Google Speech API is the possibility to add keywords (also available in Amazon Transcribe and Speechmatics), which can improve the recognition quality a lot! We automatically send words and phrases from your metadata (title, artist, chapters, track names, tags, etc.) to Google and you can add additional keywords manually (see screenshot above).
This provides a context for the recognizer and allows the recognition of nonfamous names (e.g the podcast host) or out-of-vocabulary words.

Amazon Transcribe

Amazon Transcribe offers accurate transcriptions in many languages at low costs, including keywords, word confidence, timestamps, and punctuation.

Pricing

The free tier offers 60 minutes of free usage a month for 12 months. After that, it is billed monthly at a rate of $0.0004 per second ($1.44/h). More information is available at Amazon Transcribe Pricing.

Custom Vocabulary (Keywords) Support

Custom Vocabulary (called Keywords in Auphonic) gives you the ability to expand and customize the speech recognition vocabulary, specific to your case (i.e. product names, domain-specific terminology, or names of individuals).
The same feature is also available in the Google Cloud Speech API and in Speechmatics.

Timestamps, Word Confidence, and Punctuation

Amazon Transcribe returns a timestamp and confidence value for each word so that you can easily locate the audio in the original recording by searching for the text.
It also adds some punctuation, which is combined with our own punctuation and formatting automatically.

The high quality, especially in combination with keywords, and low costs of Amazon Transcribe make the service very attractive.
However, the processing time of Amazon Transcribe is much slower compared to all our other integrated services!

Speechmatics

Speechmatics offers accurate transcriptions in many languages including word confidence values, timestamps, punctuation and custom dictionary (called Keywords in Auphonic).

Languages and Keywords

Speechmatics supports all major European, American and some Asiatic languages and takes a global language approach. That means various accents and dialects, like Indian English or Austrian German, are very well recognized even though you cannot explicitly select a specific language region.
Also the Custom Dictionary feature (called Keywords in Auphonic), which does further improve the results, is now available in the Speechmatics API.

Model Accuracy

There are two levels of accuracy for Speechmatics that you can choose when you create the Service within Auphonic for the first time. The Standard Model works much faster than the Enhanced Model, is lower in costs but still in equal accuracy range as the other services.
For the Enhanced Model you have to be more patient, as processing takes nearly as long as with Amazon Transcribe, but the quality of the result is correspondingly high. In terms of punctuation and small details in pronunciation the Enhanced Model of Speechmatics is incomparably good.

Note

If you want to use both Standard and Enhanced Model of Speechmatics once in a while, you need to create two separate services (one service for each model) in your Auphonic account!

Timestamps, Word Confidence, and Punctuation

Like Amazon Transcribe, Speechmatics creates timestamps, word confidence values, and punctuation for every single word.

Pricing

Speechmatics offers 4 hours of free speech recognition per month (2h standard model plus 2h enhanced model). Once you exceed these 4 hours, however, Speechmatics is at similar cost to the Google Cloud Speech API (or lower).
Pricing starts at $1.25 per hour of audio for Standard Model up to $1.90/h for Enhanced Model. They offer significant discounts for users requiring higher volumes. If you process a lot of content, you should contact them directly at hello@speechmatics.com and say that you wish to use it with Auphonic.
More information is available at Speechmatics Pricing.

Services Comparison Table

Auphonic Speech Recognition Services Comparison

Auphonic Whisper

Wit.ai

Google Speech API

Amazon Transcribe

Speechmatics Standard

Speechmatics Enhanced

Price

free,
also for commercial

free,
also for commercial

1+1h free per month
(Enhanced + Default Model),
then ~$0.96-$2.16/h
(depending on user settings)

1h free per month
(first 12 months),
then ~$1.44/h
cheaper for high volumes

2h free per month,
then ~$1.25/h,
much cheaper for high volumes

2h free per month,
then ~$1.90/h
much cheaper for high volumes

ASR Quality English

best

basic

good
(Enhanced model)

very good

very good

best

ASR Quality German

best

basic

basic
(Default model)

very good

very good

best

Keyword Support

Yes

No

Yes

Yes

Yes

Yes

Timestamps and Confidence Value

Yes

No

No

Yes

Yes

Yes

Speed

fast

slow

fast

much slower

medium

fast

Supported Languages

about 100 languages

over 100 languages

over 100 languages

about 40 languages

about 50 languages

about 50 languages

(Last Update: March 2023)

More Details about the comparison:

ASR Quality:

We tried to compare the relative speech recognition quality of all services in English and German (best means just the best one of our integrated services).
Please let us know if you get different results or if you compare services in other languages!

Keyword Support:

Support for keywords to expand the speech recognition vocabulary, to recognize out-of-vocabulary words.
This feature is called Word and Phrase Hints in the Google Cloud Speech API, Custom Vocabulary in Amazon Transcribe and Custom Dictionary in Speechmatics.

Timestamps and Confidence Value:

A timestamp and confidence value is returned for each word or phrase.
This is relevant for our Transcript Editor, to play each word or phrase separately and to instantly see which sections should be checked manually (low confidence).

Speed:

The relative processing speed of all services. All services are faster than real-time, but Amazon Transcribe and Speechmatics enhanced model are significantly slower compared to all other services.

Supported Languages:

Links to pages with recent supported languages and variants.

We will add additional services if and when we find services that offer improved cost benefits or better final results and support at least two languages (that’s an important step for a speech recognition company).

Auphonic Output Formats

Auphonic produces three output formats from speech recognition results: An HTML transcript file (readable by humans), a JSON or XML file with all data (readable by machines) and a WebVTT subtitles/captions file as an exchange format between systems.

HTML Transcript File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

The HTML output file contains the transcribed text with timestamps for each new paragraph, mouse hover shows the time for each text segment and speaker names are displayed in case of multitrack. Sections are automatically generated from chapter marks and the HTML file includes the audio metadata as well.

The transcription text can be copied into Wordpress or other content management systems, in order to search within the transcript and to find the corresponding timestamps (if you don’t have an audio player which supports search in WebVTT/transcripts).

Our HTML output file also includes the Auphonic Transcript Editor for easy-to-use transcription editing.

WebVTT File

Examples: EN Singletrack, EN Multitrack, DE Singletrack

WebVTT is the open specification for subtitles, captions, chapters, etc. The WebVTT file can be added as a track element within the HTML5 audio/video element. For an introduction see Getting started with the HTML5 track element.
It is supported by all major browsers and also many other systems use it already (screenreaders, (web) audio players with WebVTT display+search like the player from Podlove, software libs, etc.).
It is possible to add other time-based metadata as well: Not only the transcription text, also speaker names, styling or any other custom data like GPS coordinates are possible.

Search engines could parse WebVTT files in audio/video tags, as the format is well defined, then we would have searchable audio/video.
It is also possible to link to an external WebVTT file in an RSS feed, then podcast players and other feed-based systems could parse the transcript as well (for details see this discussion).

WebVTT is therefore a great exchange format between different systems: audio players, speech recognition systems, human transcriptions, feeds, search engines, CMS, etc.

JSON/XML Output File

Examples: EN Singletrack, EN Multitrack, DE Singletrack
Examples with word timestamps: EN Singletrack, EN Multitrack, DE Singletrack

This file contains all the speech recognition details in JSON or XML format. This includes the text, punctuation and paragraphs with timestamps and confidence values.
Word timestamps and confidence values are also available if you use Speechmatics or Amazon Transcribe.

Tips to Improve Speech Recognition Accuracy

Audio quality is important
  • Reverberant audio is quite a problem, put the microphone as close to the speaker as possible.

  • Try to avoid background sounds and noises during recording.

  • Don’t use background music either (unless you use our multitrack version).

  • Only use fast as well as stable skype/hangout connections.

Speak clearly
  • Pronunciation and grammar are important.

  • Dialects are more difficult to understand, use the correct language variant if available (e.g. English-UK vs. English-US).

  • Don’t interrupt other speakers. This has a huge impact on the accuracy!

  • Don’t mix languages.

Use a lot of metadata and keywords
  • This is a big help for the Google Speech API, Amazon Transcribe and Speechmatics.

  • When using metadata and keywords, it contributes a lot to make the recognition of special names, terms and out-of-vocabulary words easier.

  • As always, accurate metadata is important!

Use our multitrack version
  • If you record a separate track for each speaker, use our multitrack speech recognition.

  • This will lead to better results and more accurate information on the timing of each speaker.

  • Background music/sounds should be put into a separate track, so as to not interfere with the speech recognition.

Auphonic Transcript Editor

../_images/TranscriptEditor.png

We’ve rolled out a brand new Transcript Editor, designed to enhance your editing experience and streamline your workflow like never before.

To learn all about the latest features, check out our recent blog post: New Auphonic Transcript Editor

We’re currently updating our help page to include details about our new Transcript Editor. Keep an eye out for the upcoming changes!