Authentication of intended speech as part of an enrollment process

Gespeichert in:
Bibliographische Detailangaben
Titel: Authentication of intended speech as part of an enrollment process
Patent Number: 11869,510
Publikationsdatum: January 09, 2024
Appl. No: 17/191538
Application Filed: March 03, 2021
Abstract: Described are systems, methods, and apparatus that detect keywords in one or more speech segments to authenticate that the speech is generated by the speaker as part of an intentional enrollment by the speaker into a service. For example, as a speech segment is received as part of an enrollment process, the speech segment may be converted into a log melspectrogram and the log melspectrogram may be processed using a machine learning model to determine if an expected keyword is represented by the log melspectrogram. If the keyword is detected, it may be determined that the speech output by the speaker is output as part of an intentional enrollment process.
Inventors: Amazon Technologies, Inc. (Seattle, WA, US)
Assignees: Amazon Technologies, Inc. (Seattle, WA, US)
Claim: 1. A computer-implemented method, comprising: presenting, on a display of a portable device, a first phrase to be spoken as part of an enrollment process; as the first phrase is presented on the portable device, recording, at the portable device, a first audio data collected by the portable device; converting, at the portable device, at least a portion of the first audio data recorded by the portable device into a first log melspectrogram; processing, at the portable device and with a machine learning model trained with log melspectrograms representative of a plurality of keywords, the first log melspectrogram to determine that at least a portion of the first log melspectrogram represents a first keyword of the plurality of keywords; presenting, on the display of the portable device, a second phrase to be spoken as part of the enrollment process; as the second phrase is presented on the portable device, recording, at the portable device, a second audio data collected by the portable device; converting, at the portable device, at least a portion of the second audio data recorded by the portable device into a second log melspectrogram; processing, at the portable device and with the machine learning model, the second log melspectrogram to determine that at least a portion of the second log melspectrogram represents a second keyword of the plurality of keywords; in response to determining that the at least a portion of the first log melspectrogram represents the first keyword and determining that the at least a portion of the second log melspectrogram represents the second keyword, confirming that the first audio data and the second audio data correspond to phrases intentionally spoken as part of the enrollment process; and in response to confirming that the first audio data and the second audio data correspond to phrases intentionally spoken: sending, to a second device that is separate from the portable device, a request for a third audio data generated by the second device; and generating an embedding vector representative of a speech, based at least in part on: at least a first portion of the first audio data recorded by the portable device; at least a second portion of the second audio data recorded by the portable device; and at least a third portion of the third audio data received from the second device.
Claim: 2. The computer-implemented method of claim 1 , wherein processing the first log melspectrogram further includes: generating, as output from the machine learning model, at least: a first probability score indicative of a first probability that the at least a portion of the first log melspectrogram is representative of the first keyword; and a second probability score indicative of a second probability that the at least a portion of the first log melspectrogram is representative of the second keyword.
Claim: 3. The computer-implemented method of claim 2 , wherein: a first confidence threshold is associated with the first keyword; a second confidence threshold that is different than the first confidence threshold is associated with the second keyword; and processing the first log melspectrogram further includes determining that the first probability score exceeds the first confidence threshold corresponding to the first keyword.
Claim: 4. The computer-implemented method of claim 1 , further comprising: determining that an order of detection of the first keyword and the second keyword corresponds to an order of presentation of the first phrase and the second phrase.
Claim: 5. The computer-implemented method of claim 1 , wherein the neutral emotional speech profile indicates one or more of a pitch of a voice represented in at least the third audio data, a tone of the voice represented in at least the third audio data, or a cadence of the voice represented in at least the third audio data.
Claim: 6. The computer-implemented method of claim 1 , further comprising: obtaining from a second device that is different than the portable device, the third audio data.
Claim: 7. A computing system, comprising: one or more processors; and a memory storing program instructions that, when executed by the one or more processors, cause the one or more processors to at least: receive a plurality of audio data segments representative of a speech; for each of the plurality of audio data segments: convert the audio data segment into a log melspectrogram; process the log melspectrogram to determine that a keyword of a plurality of keywords is represented by at least a portion of the log melspectrogram; and determine a keyword count indicative of a number of keywords determined to be represented in the plurality of audio data segments; determine if the keyword count exceeds an authentication threshold; and in response to a determination that the keyword count exceeds the authentication threshold: send, to a device that is separate from the computing system, a request for a second plurality of audio data segments generated by the device; and generate an embedding vector representative of at least a portion of the speech, based at least in part on: at least a first portion of the audio data received at the computing system; and at least a second portion of the second plurality of audio data segments generated by the device.
Claim: 8. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: in response to a determination that the keyword count does not exceed the authentication threshold: determine that the speech is not intended as part of an enrollment in a service; and cause the plurality of audio data segments to be discarded by the computing system.
Claim: 9. The computing system of claim 8 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: in response to the determination that the keyword count does not exceed the authentication threshold, cause a second plurality of audio data segments generated by a device that is separate from the computing system to be discarded.
Claim: 10. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors to process the log melspectrogram, further include instructions that, when executed by the one or more processors, further cause the one or more processors to at least: process the log melspectrogram using a machine learning model to determine, for each of the plurality of keywords, a probability score indicative of a probability that the keyword is represented by at least a portion of the log melspectrogram.
Claim: 11. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors to process the log melspectrogram, further include instructions that, when executed by the one or more processors, further cause the one or more processors to at least: process the log melspectrogram to determine, for each of a plurality of keywords, a probability score indicative of a probability that the keyword is represented by at least a portion of the log melspectrogram; and determine if a probability score determined for a keyword of the plurality of keywords exceeds a confidence threshold corresponding to the keyword.
Claim: 12. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: determine a detection order of the keywords that are determined to be represented by at least a portion of a log melspectrogram; and determine if the detection order corresponds to an expected detection order; and wherein confirmation that the speech is intended is further in response to a determination that the detection order corresponds to the expected detection order.
Claim: 13. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least, for each of the plurality of audio data segments: determine that a length of the audio data segment is less than a defined length; and in response to a determination that the length of the audio data segment is less than the defined length, add a padding to the audio data segment such that the audio data segment is of the defined length.
Claim: 14. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least, for each of the plurality of audio data segments: determine that a length of the audio data segment is greater than a defined length; and in response to a determination that the length of the audio data segment is greater than the defined length, truncate a portion of the audio data segment such that the audio data segment is of the defined length.
Claim: 15. The computing system of claim 14 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: determine a first portion of the audio data segment in which a keyword is anticipated; and wherein truncation of the audio data segment corresponds to a second portion of the audio data segment that is different than the first portion.
Claim: 16. The computing system of claim 7 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: subsequent to causing the embedding vector to be generated, receive an indication of a first keyword that if detected as spoken is to cause an action to be performed; subsequent to receiving the indication of the first keyword, receive a first audio data segment; convert the first audio data segment into a first log melspectrogram; process the first log melspectrogram to determine a first probability score indicative of a first probability that the first audio data segment includes the first keyword; determine that the first probability score exceeds a first confidence threshold corresponding to the first keyword; and in response to a determination that the first probability score exceeds the first confidence threshold, cause the action to be performed.
Claim: 17. An apparatus, comprising: a microphone; one or more processors communicatively coupled to the microphone; and a memory storing program instructions that, when executed by the one or more processors, cause the one or more processors to at least: receive, from the microphone and as part of an enrollment process, a first audio data segment; convert the first audio data segment into a first log melspectrogram; process the first log melspectrogram to determine that at least a portion of the first log melspectrogram represents a first keyword; receive, from the microphone and as part of the enrollment process, a second audio data segment; convert the second audio data segment into a second log melspectrogram; process the second log melspectrogram to determine that at least a portion of the second log melspectrogram represents a second keyword; and in response to a determination that the at least a portion of the first log melspectrogram represents the first keyword and a determination that the at least a portion of the second log melspectrogram represents the second keyword, authenticate the first audio data segment and the second audio data segment as intentionally spoken as part of the enrollment process; in response to authentication that the first audio data segment and the second audio data segment correspond to phrases intentionally spoken: cause a recording of the phrases intentionally spoken to be transmitted from a second device to the apparatus; and generate an embedding vector based at least in part on the phrases, the first audio data segment, and the second audio data segment.
Claim: 18. The apparatus of claim 17 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: in response to a determination that at least a portion of the first log melspectrogram represents the first keyword, increase a keyword count; in response to a determination that at least a portion of the second log melspectrogram represents the second keyword, increase the keyword count; receive, from the microphone and as part of the enrollment process, a third audio data segment; convert the third audio data segment into a third log melspectrogram; process the third log melspectrogram to determine that the third log melspectrogram does not represent a third keyword; in response to a determination that the third log melspectrogram does not represent the third keyword, not increase the keyword count; and determine that the keyword count exceeds an authentication threshold; and wherein the program instruction that, when executed by the one or more processors to authenticate the first audio data segment and the second audio data segment, are further executed in response to a determination that the keyword count exceeds the authentication threshold.
Claim: 19. The apparatus of claim 17 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: determine that an order in which the first keyword is determined and the second keyword is determined corresponds to an expected keyword order; and wherein the program instructions that, when executed by the one or more processors to authenticate the first audio data segment and the second audio data segment, are further executed in response to a determination that the order in which the first keyword is determined and the second keyword is determined correspond to the expected keyword order.
Claim: 20. The apparatus of claim 17 , wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least: determine that a length of the second audio data segment is greater than a defined length; and in response to a determination that the length of the second audio data segment is greater than the defined length, segment the second audio data segment into a third audio data segment of the defined length and a fourth audio data segment of the defined length; and wherein the program instructions that, when executed by the one or more processors to convert the second audio data segment into the second log melspectrogram, further include instructions that, when executed by the one or more processors, further cause the one or more processors to at least: convert the third audio data segment into a third log melspectrogram; and convert the fourth audio data segment into a fourth log melspectrogram; and wherein the program instructions that, when executed by the one or more processors to process the second log melspectrogram, further include instructions that, when executed by the one or more processors, further cause the one or more processors to at least: process the third log melspectrogram and the fourth log melspectrogram to determine that at least a portion of the third log melspectrogram or at least a portion of the fourth log melspectrogram represent the second keyword.
Patent References Cited: 20050288930 December 2005 Shaw
20060149558 July 2006 Kahn
20090241201 September 2009 Wootton
20150301796 October 2015 Visser
20180261213 September 2018 Arik
20190243956 August 2019 Sheets
20210233525 July 2021 Jaiswal
20210264948 August 2021 Mizutani
20210272584 September 2021 McAlpine
20220130415 April 2022 Garrison
20220343895 October 2022 Tomar
Assistant Examiner: Mueller, Paul J.
Primary Examiner: Washburn, Daniel C
Attorney, Agent or Firm: Athorus, PLLC
Dokumentencode: edspgr.11869510
Datenbank: USPTO Patent Grants
Beschreibung
Abstract:Described are systems, methods, and apparatus that detect keywords in one or more speech segments to authenticate that the speech is generated by the speaker as part of an intentional enrollment by the speaker into a service. For example, as a speech segment is received as part of an enrollment process, the speech segment may be converted into a log melspectrogram and the log melspectrogram may be processed using a machine learning model to determine if an expected keyword is represented by the log melspectrogram. If the keyword is detected, it may be determined that the speech output by the speaker is output as part of an intentional enrollment process.