This app is in research state.
For speech-to-text, I'm using Whisper for the moment. However, the "turbo" model takes 45 minutes to transcribe an hour of audio on my machine. I still need to check out reverb-asr and Apple's transcriptions.
The nice thing about Whisper, though, is that it gives you word timestamps. (Maybe others do, too?) Using that, I'm able to find gaps in speech and break the transcribed text into stretches of sentences that usually make sense to people.
I can then summarize each stretch. Bart's text summarizer does better than Falconsai's, but no text summarizer is going to be consistently correct. So, ultimately, the UI will have to bill this as "flavor" and encourage users to click through to the transcription if the summary of the stretch interests them.
Of course, the transcription could be wrong, too, so we need to provide a way to listen to the corresponding audio.
Another form of summarization that I'm working on is
tf-idf. This could be used to call out terms that are unusual to the episode or to the podcast as a whole. This will help users that are just searching for information about a particular topic.