0

I'm building an application to calculate delay based on keywords found. The method I used is not accurate or even wrong (error). The methods used are as follows:

@Override
public void onResults(Bundle results) {
    progressBar.setVisibility(View.GONE);
    ArrayList<String> matches = results.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION);

    if (matches != null && !matches.isEmpty()) {
        String transcript = matches.get(0);
        textSTT.setText(transcript);

        if (transcript.contains(keyword)) {
            long endTime = System.nanoTime(); // Using nano
            delay = endTime - startTime;

            double delayInSeconds = delay / 1_000_000_000.0;
            double roundedDelay = roundToTwoDecimalPlaces(delayInSeconds); // Rounding delay

            delayResult.setText(String.format("Delay: %.2f seconds\nTranscription result: %s", roundedDelay, transcript));
        } else {
            delayResult.setText("Keyword not found.");
        }
    } else {
        delayResult.setText("No result from speech recognition.");
    }
}

Condition:

  • When I first press the button to start recording, after that I immediately mention the keyword (approximately the 1st second). But the results show the delay obtained is 2 or even more.

Tools:

  • Speech Recognizer (Android default)

I have set 10 seconds for the countdown. When I pressed the button to start recording, I immediately mentioned the keyword (approximately the 1st second). But the delay result shows the 2nd second or even more. Similarly, when I try to mention the keyword at the 5th second, but the delay result shows the 7th second or even more.

I want the delay calculation to be more accurate, for example: I mention the keyword at the 7th second, then the delay is 7 seconds.

1
  • We do not know what startTime is or how it relates to what you are trying to do. Also, we do not know when onResults() gets called and how it relates to what you are trying to do. If we guess that onResults() is tied to receiving the speech recognition results, then endTime is the time when you get those results, and that has nothing to do with when in a recording any particular thing is said. Commented Dec 4, 2024 at 17:53

2 Answers 2

1

Speech recognition takes time. The time it takes is not constant. If you want to tell the offset into a sound clip a word happens, you don't want to do live analysis. What you're measuring this way is not the time it takes for you to say it, but the time it takes for you to say it plus the time it takes the speech recognizer to recognize that you said it. Which will always be far greater, as the recognizer needs time after the word to recognize a word ended at the very least. And the speech recognition engine is not made to return the type of data you're looking for- it's meant for just transcription.

A better way to do this would be to use a custom speech engine that returns that type of data. Or take a custom speech engine and alter it to return its guesses with timestamps. But the approach you're trying will never work.

Sign up to request clarification or add additional context in comments.

Comments

1

You're using the time that the result is returned, but there's a processing delay so that won't be the time the word was spoken. The speech recognizer needs to return that timestamp itself.

There's a RecognizerIntent#EXTRA_REQUEST_WORD_TIMING to get this with the Android SpeechRecognizer. The timestamps are placed with the RECOGNITION_PARTS key in the results Bundle.

It doesn't seem to work with the regular Recognizer though. The on-device Recognizer supports it, but that may not work on all devices (it works on Pixel).

Alternatively you can look for a third-party speech recognition library that supports something like this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.