Skip to content

Commit bc21038

Browse files
authored
Add speech recognition phrase list to the Web Speech API (#145)
* Add speech recognition context to the Web Speech API Introduce a new speech recognition context feature for contextual biasing * Add phrases instead of context Remove SpeechRecognitionContext and add SpeechRecognitionPhraseList to SpeechRecognition directly Remove updateContext and always update phrases instead Rename context-not-supported error code to phrases-not-supported Add removeItem to SpeechRecognitionPhraseList * Minor updates for comments * Add descriptions for corner cases
1 parent fef50de commit bc21038

2 files changed

Lines changed: 159 additions & 3 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
index.html
2+
.DS_Store
3+
.idea/

index.bs

Lines changed: 157 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,18 @@ The term "interim result" indicates a SpeechRecognitionResult in which the final
151151
A boolean flag representing whether the speech recognition started. The initial value is <code>false</code>.
152152
</dl>
153153

154+
<dl dfn-type=attribute dfn-for="SpeechRecognition">
155+
: <dfn>[[mode]]</dfn>
156+
::
157+
A {{SpeechRecognitionMode}} enum to determine where speech recognition takes place. The initial value is <code>ondevice-preferred</code>.
158+
</dl>
159+
160+
<dl dfn-type=attribute dfn-for="SpeechRecognition">
161+
: <dfn>[[phrases]]</dfn>
162+
::
163+
A {{SpeechRecognitionPhraseList}} representing a list of phrases for contextual biasing. The initial value is null.
164+
</dl>
165+
154166
<xmp class="idl">
155167
[Exposed=Window]
156168
interface SpeechRecognition : EventTarget {
@@ -162,6 +174,7 @@ interface SpeechRecognition : EventTarget {
162174
attribute boolean interimResults;
163175
attribute unsigned long maxAlternatives;
164176
attribute SpeechRecognitionMode mode;
177+
attribute SpeechRecognitionPhraseList phrases;
165178

166179
// methods to drive the speech interaction
167180
undefined start();
@@ -192,7 +205,8 @@ enum SpeechRecognitionErrorCode {
192205
"network",
193206
"not-allowed",
194207
"service-not-allowed",
195-
"language-not-supported"
208+
"language-not-supported",
209+
"phrases-not-supported"
196210
};
197211

198212
enum SpeechRecognitionMode {
@@ -254,12 +268,29 @@ dictionary SpeechRecognitionEventInit : EventInit {
254268
unsigned long resultIndex = 0;
255269
required SpeechRecognitionResultList results;
256270
};
271+
272+
// The object representing a phrase for contextual biasing.
273+
[Exposed=Window]
274+
interface SpeechRecognitionPhrase {
275+
constructor(DOMString phrase, optional float boost = 1.0);
276+
readonly attribute DOMString phrase;
277+
readonly attribute float boost;
278+
};
279+
280+
// The object representing a list of phrases for contextual biasing.
281+
[Exposed=Window]
282+
interface SpeechRecognitionPhraseList {
283+
constructor(sequence<SpeechRecognitionPhrase> phrases);
284+
readonly attribute unsigned long length;
285+
SpeechRecognitionPhrase item(unsigned long index);
286+
undefined addItem(SpeechRecognitionPhrase item);
287+
undefined removeItem(unsigned long index);
288+
};
257289
</xmp>
258290

259291
<h4 id="speechreco-attributes">SpeechRecognition Attributes</h4>
260292

261293
<dl>
262-
263294
<dt><dfn attribute for=SpeechRecognition>lang</dfn> attribute</dt>
264295
<dd>This attribute will set the language of the recognition for the request, using a valid BCP 47 language tag. [[!BCP47]]
265296
If unset it remains unset for getting in script, but will default to use the language of the html document root element and associated hierarchy.
@@ -283,7 +314,35 @@ dictionary SpeechRecognitionEventInit : EventInit {
283314
The default value is 1.</dd>
284315

285316
<dt><dfn attribute for=SpeechRecognition>mode</dfn> attribute</dt>
286-
<dd>An enum to determine where speech recognition takes place. The default value is "ondevice-preferred".</dd>
317+
<dd>
318+
This attribute represents where speech recognition takes place.
319+
</dd>
320+
<dd>
321+
The getter steps are to return the value of {{SpeechRecognition/[[mode]]}}.
322+
</dd>
323+
<dd>
324+
The setter steps are:
325+
1. If the {{SpeechRecognitionPhraseList/length}} of {{SpeechRecognition/phrases}} is greater than 0
326+
and the system using the given value for {{SpeechRecognition/[[mode]]}} does not support contextual biasing,
327+
throw a {{SpeechRecognitionErrorEvent}} with the {{SpeechRecognitionErrorCode/phrases-not-supported}}
328+
error code and abort these steps.
329+
1. Set {{SpeechRecognition/[[mode]]}} to the given value.
330+
</dd>
331+
332+
<dt><dfn attribute for=SpeechRecognition>phrases</dfn> attribute</dt>
333+
<dd>
334+
This attribute represents a list of phrases for contextual biasing.
335+
</dd>
336+
<dd>
337+
The getter steps are to return the value of {{SpeechRecognition/[[phrases]]}}.
338+
</dd>
339+
<dd>
340+
The setter steps are:
341+
1. If the {{SpeechRecognitionPhraseList/length}} of the given value is greater than 0 and the system does not support contextual biasing,
342+
throw a {{SpeechRecognitionErrorEvent}} with the {{phrases-not-supported}} error code and abort these steps.
343+
1. Set {{SpeechRecognition/[[phrases]]}} to the given value.
344+
1. Send a copy of {{SpeechRecognition/[[phrases]]}} to the system for initializing or updating the phrases for contextual biasing implementation.
345+
</dd>
287346
</dl>
288347

289348
<p class=issue>The group has discussed whether WebRTC might be used to specify selection of audio sources and remote recognizers.
@@ -479,6 +538,9 @@ For example, some implementations may fire <a event for=SpeechRecognition>audioe
479538

480539
<dt><dfn enum-value for=SpeechRecognitionErrorCode>"language-not-supported"</dfn></dt>
481540
<dd>The language was not supported.</dd>
541+
542+
<dt><dfn enum-value for=SpeechRecognitionErrorCode>"phrases-not-supported"</dfn></dt>
543+
<dd>The speech recognition model does not support phrases for contextual biasing.</dd>
482544
</dl>
483545
</dd>
484546

@@ -557,6 +619,98 @@ For a non-continuous recognition it will hold only a single value.</p>
557619
Note that when resultIndex equals results.length, no new results are returned, this may occur when the array length decreases to remove one or more interim results.</dd>
558620
</dl>
559621

622+
<h4 id="speechreco-phrase">SpeechRecognitionPhrase</h4>
623+
624+
<p>The SpeechRecognitionPhrase object represents a phrase for contextual biasing and has the following internal slots:</p>
625+
626+
<dl dfn-type=attribute dfn-for="SpeechRecognitionPhrase">
627+
: <dfn>[[phrase]]</dfn>
628+
::
629+
A {{DOMString}} representing the text string to be boosted. The initial value is null.
630+
An empty value is allowed but should be ignored by the speech recognition model.
631+
</dl>
632+
633+
<dl dfn-type=attribute dfn-for="SpeechRecognitionPhrase">
634+
: <dfn>[[boost]]</dfn>
635+
::
636+
A float representing approximately the natural log of the number of times more likely the website thinks this phrase is
637+
than what the speech recognition model knows.
638+
A valid boost must be a float value inside the range [0.0, 10.0], with a default value of 1.0 if not specified.
639+
A boost of 0.0 means the phrase is not boosted at all, and a higher boost means the phrase is more likely to appear.
640+
A boost of 10.0 means the phrase is extremely likely to appear and should be rarely set.
641+
</dl>
642+
643+
<dl>
644+
<dt><dfn constructor for=SpeechRecognitionPhrase>SpeechRecognitionPhrase(|phrase|, |boost|)</dfn> constructor</dt>
645+
<dd>
646+
When this constructor is invoked, run the following steps:
647+
1. If |boost| is smaller than 0.0 or greater than 10.0, throw a {{SyntaxError}} and abort these steps.
648+
1. Let |phr| be a new object of type {{SpeechRecognitionPhrase}}.
649+
1. Set |phr|.{{[[phrase]]}} to be the value of |phrase|.
650+
1. Set |phr|.{{[[boost]]}} to be the value of |boost|.
651+
1. Return |phr|.
652+
</dd>
653+
654+
<dt><dfn attribute for=SpeechRecognitionPhrase>phrase</dfn> attribute</dt>
655+
<dd>This attribute returns the value of {{[[phrase]]}}.</dd>
656+
657+
<dt><dfn attribute for=SpeechRecognitionPhrase>boost</dfn> attribute</dt>
658+
<dd>This attribute returns the value of {{[[boost]]}}.</dd>
659+
</dl>
660+
661+
<h4 id="speechreco-phraselist">SpeechRecognitionPhraseList</h4>
662+
663+
<p>The SpeechRecognitionPhraseList object holds a list of phrases for contextual biasing and has the following internal slot:</p>
664+
665+
<dl dfn-type=attribute dfn-for="SpeechRecognitionPhraseList">
666+
: <dfn>[[phrases]]</dfn>
667+
::
668+
A list of {{SpeechRecognitionPhrase}} representing the phrases to be boosted. The initial value is an empty list.
669+
</dl>
670+
671+
<dl>
672+
<dt><dfn constructor for=SpeechRecognitionPhraseList>SpeechRecognitionPhraseList(|phrases|)</dfn> constructor</dt>
673+
<dd>
674+
When this constructor is invoked, run the following steps:
675+
1. Let |list| be a new object of type {{SpeechRecognitionPhraseList}}.
676+
1. Set |list|.{{SpeechRecognitionPhraseList/[[phrases]]}} to be the value of |phrases|.
677+
1. Return |list|.
678+
</dd>
679+
680+
<dt><dfn attribute for=SpeechRecognitionPhraseList>length</dfn> attribute</dt>
681+
<dd>
682+
This attribute indicates the number of phrases in the list.
683+
When invoked, return the number of items in {{SpeechRecognitionPhraseList/[[phrases]]}}.
684+
</dd>
685+
686+
<dt><dfn method for=SpeechRecognitionPhraseList>item(|index|)</dfn> method</dt>
687+
<dd>
688+
This method gets the {{SpeechRecognitionPhrase}} object at the |index| of the list.
689+
When invoked, run the following steps:
690+
1. If |index| is smaller than 0, or greater than or equal to {{SpeechRecognitionPhraseList/length}},
691+
throw a {{RangeError}} and abort these steps.
692+
1. Return the {{SpeechRecognitionPhrase}} at the |index| of {{SpeechRecognitionPhraseList/[[phrases]]}}.
693+
</dd>
694+
695+
<dt><dfn method for=SpeechRecognitionPhraseList>addItem(|item|)</dfn> method</dt>
696+
<dd>
697+
This method adds the {{SpeechRecognitionPhrase}} object |item| to the list.
698+
When invoked, add |item| to the end of {{SpeechRecognitionPhraseList/[[phrases]]}}.
699+
The list is allowed to have multiple {{SpeechRecognitionPhrase}} objects with the same {{SpeechRecognitionPhrase/[[phrase]]}} value,
700+
and the speech recognition model should use the last {{SpeechRecognitionPhrase/[[boost]]}} value
701+
for this {{SpeechRecognitionPhrase/[[phrase]]}} in the list.
702+
</dd>
703+
704+
<dt><dfn method for=SpeechRecognitionPhraseList>removeItem(|index|)</dfn> method</dt>
705+
<dd>
706+
This method removes the {{SpeechRecognitionPhrase}} object at the |index| of the list.
707+
When invoked, run the following steps:
708+
1. If |index| is smaller than 0, or greater than or equal to {{SpeechRecognitionPhraseList/length}},
709+
throw a {{RangeError}} and abort these steps.
710+
1. Remove the {{SpeechRecognitionPhrase}} object at the |index| of {{SpeechRecognitionPhraseList/[[phrases]]}}.
711+
</dd>
712+
</dl>
713+
560714
<h3 id="tts-section">The SpeechSynthesis Interface</h3>
561715

562716
<p>The SpeechSynthesis interface is the scripted web API for controlling a text-to-speech output.</p>

0 commit comments

Comments
 (0)