-
I was able to package the onnx models in Electron so that when installed (at least via AppImage), the models actually load on the target machine and transcriptions work. I had a lot of doubts about whether this would actually work.
-
I got transcription working in an Electron app via moonshine via onnx. A lot of experimenting (AKA repeated guessing) had to be done to get models loaded and audio passed to them.
I had to import llama-tokenizer-js like this:
((async function importTokenizer() {
var tokenizerModule = await import('llama-tokenizer-js');
llamaTokenizer = new tokenizerModule.LlamaTokenizer();
})());
Instead of:
import llamaTokenizer from 'llama-tokenizer-js';
Very weird, but I'm not willing to do the research to find out what to blame for this. -
I tried out
Moonshine for speech-to-text. It took only five minutes to transcribe thirty minutes of audio on my laptop, though the fans were blasting. Still, that's in the realm of usable. And it claims:Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
So, this project might live?
But then I have to think about how I would install Moonshine on a user's machine. I really have no idea. -
Oh, also, the reverb-asr model is 21 GB.
-
Impressive: rev.com transcribed a 32-minute podcast in under eight minutes. The transcript looks fairly accurate.
I don't think they offer an API. You can download and run their model locally, but I'm guessing it would be slow on a typical laptop. -
I tried using Vosk's vosk-model-en-us-0.42-gigaspeech model for transcription. Last time I tried, I got nothing. This time I realized that I forgot to make sure the wav file was 16-bit, and I got stuff.
It certainly wasn't faster than Whisper. Forty minutes in, I killed it off because it still wasn't done. Then, I noticed the Node API has an async method, and thought it could get a lot faster.
Unfortunately, it crashes right away if you try to transcribe more than one segment at a time. (If you do one segment at a time with the async method — which is pointless — it runs out of memory and gets killed.) Issues like this make me realize that the Node bindings are slapped together:I have been learning the hard way that the acceptWaveformAsync() method is a very dangerous beast, and calling free() in the middle of its processing is not the only issue with it. It cannot be simply used like typical Node.js single-threaded, asynchronous code style.
I don't blame Vosk; they were probably did not know what was involved in making Node bindings.
So, I could try one of their smaller models, but via their sync method, but I wouldn't be surprised if I still hit some instability. -
On the content analysis front, I realized I needed a script that runs the entire pipeline, from transcription to tf-idf, so I can compare different configurations of that pipeline easily.
It should have been easy, but no, Python. It's very hard to run a Python script in a venv environment inside a bash script. Doing things with `source` just wouldn't work for me. After thirty minutes of trying various things, I realized the bash script could directly run executable stuff in venv/bin instead of trying to get the effects of venv/bin/activate into the current shell in order run a Python script. -
-
OK, I've implemented rough tfidf. When I filter the words in each stretch (each somewhere around a minute) from a podcast about local politics for tfidf score above 0.5, here's what I get:
"chelly, pronouns, danielle, thrilled, fourth",
At a glance, I think maybe it could tell you where to go in the podcast if you want to know about school issues. Later, I'll see if it lines up with what I hear.
"loves, colette's, bakery, bread, partial",
"method, politics, tired, story",
"summer, announced, john, amaral, read",
"escaped",
"",
"square, damn, prompted, half, day",
"immediately, sworn, aides, mentioned, -time",
"stepping",
"named",
"explanation, website, roles, responsibilities, working",
"friday, attending, groundwork, minutes, released",
"defined, differing, opinions, respond, recent",
"connect, staff, directors, overwhelming, figure",
"specific, met, wages, recognized",
"areas, active, math, view, buildings",
"space, districts, mcglynn, neighborhood, including",
"numbers, miss, everett, lines, partnering",
"plumbing, shop, grown, leaps, bounds",
"broad, throw, appreciated, systemic, thinks",
"balance, effectual, moment, displacement, gentrification",
"impact, resolution, represented, resources, free",
"",
"googled, speaking, cities",
"success, pros, cons, private, opinion",
"situation, expired, agreed, table",
"scheduled, forgive, turns, associations, weeks",
"gap, sides, negotiate, represents",
"approve, decides, committee's, parties, agreement",
"classes, reimbursement, constraint",
"license, number, pdps, informs, bigger",
"classes, reimbursement, constraint",
"license, number, pdps, informs, bigger",
"goal, spectrum, curious, beliefs, finish",
"grade, prom, experience, concise, warm",
"symphony, algebra, level, test, scores",
"positions, incumbency, choosing, step, rare",
"commitment, disappear, teases, tough, money",
"2023, 20, november, calendar, order",
"excited, son, stem, awesome, cool",
"selling, -shirts, present",
"remarks, proud, strong, history, dr",
"",
"brings, attention, homework, club, escaping",
"asks, absolutely, nicely, dovetails, comment",
"importance, reporting, harder",
"",
"mention",
"apparently, plenty",
"participants, crossed, mind, suggest, idea",
"nice, presenting, formal, body, laid",
"tonight's, current, teacher's, supporting, purchasing",
"feedback, future, medfordpod, gmail, .com" -
This app is in research state.
For speech-to-text, I'm using Whisper for the moment. However, the "turbo" model takes 45 minutes to transcribe an hour of audio on my machine. I still need to check out reverb-asr and Apple's transcriptions.
The nice thing about Whisper, though, is that it gives you word timestamps. (Maybe others do, too?) Using that, I'm able to find gaps in speech and break the transcribed text into stretches of sentences that usually make sense to people.
I can then summarize each stretch. Bart's text summarizer does better than Falconsai's, but no text summarizer is going to be consistently correct. So, ultimately, the UI will have to bill this as "flavor" and encourage users to click through to the transcription if the summary of the stretch interests them.
Of course, the transcription could be wrong, too, so we need to provide a way to listen to the corresponding audio.
Another form of summarization that I'm working on is tf-idf. This could be used to call out terms that are unusual to the episode or to the podcast as a whole. This will help users that are just searching for information about a particular topic.