SR-06/SR-07 Speech Recognition Kit

Construction Manual & User Guide

Published by Images SI Inc.

109 Woods of Arden Road, Staten Island NY 10312

Voice: 718.966.3694 | Fax: 718.966.3695

Introduction

The speech recognition kit is a complete, easy-to-build, programmable speech recognition circuit. It is programmable in the sense that you train the words (or vocal utterances) you want the circuit to recognize. This kit allows you to experiment with many facets of speech recognition technology.

Features of the kit include:

Self-contained, stand-alone speech recognition circuit
User programmable
40 or 20 word vocabulary
Multi-lingual capability
Non-volatile memory backup
Easily interfaced to control external circuits & appliances

Speech recognition is poised to become a primary method for controlling appliances, toys, tools, and computers. At its most basic level, speech-controlled appliances and tools allow users to perform parallel tasks, such as keeping hands and eyes occupied elsewhere while working with a tool or appliance.

The core of the circuit is the HM2007 speech recognition IC. This IC can recognize either 40 words, each with a length of 0.96 seconds, or 20 words, each with a length of 1.92 seconds.

Applications

There are several areas for the application of voice recognition technology:

Speech controlled appliances and toys
Speech assisted computer games
Speech assisted virtual reality
Telephone assistance systems
Voice recognition security
Speech to speech translation

Circuit Construction

The SR-07 Speech Recognition Circuit schematic is shown in Figure 1. The SR-07 utilizes three separate printed circuit boards (PCBs). Components are mounted on the top side of each PCB, with white silk screen component drawings. Components are soldered on the opposite side of the PCB, and any excess wire is clipped after soldering.

Chip Installation

When installing integrated circuit (IC) chips, first identify the top of the chip. The top typically has a marker, often a half-circle cutout or a small mark indicating pin 1. These marks show the top of the IC. Orient the ICs with the white silk screen drawings (usually a half-circle cutout) or parts placement drawings, and install them into their sockets.

Figure 2: Diagram showing component placement on the display board and a photograph of the finished display board with two 7-segment LED displays.

Display Board:

Construction begins with the display board (see Figure 2). Mount and solder 16 (220 ohm) resistors (color bands red-red-brown, gold or silver). Next, solder two 14-pin sockets for the LED display ICs (U8 and U9). Install the LED displays into the sockets, aligning the dots on the display.

The chips face the bottom of the PCB. Mount and solder the two 16-pin sockets for the 4511 ICs (U4 and U5), ensuring correct orientation. Install the 4511 ICs into their sockets, again ensuring proper orientation. Below U4, there are three solder pads in a row; solder a jumper wire from the center pad to the right pad marked with a "C". Finish the display board by mounting and soldering the 10-pin female header to the PCB.

Figure 3: Photograph of the main circuit board illustrating component placement.

Main Circuit Board:

The PCB layout for the main board is shown in Figure 3. Begin construction by mounting and soldering the three IC sockets: the HM2007 PLCC uses a 52-pin square socket (U1), the 8K static RAM uses a 28-pin socket (U2), and the 74LS373 uses a 20-pin socket (U3).

Mount and solder resistor R1 (100K, brown-black-yellow-gold). Solder resistor R2 (6.8K, blue-grey-red-gold). Solder resistor R3 (22K, red-red-orange-gold). Solder resistor R4 (330 ohm, orange-orange-brown-gold).

Mount and solder diodes D1 and D2, ensuring the black band faces the correct direction as shown in the drawings. Mount and solder the 3.57 MHz crystal (XTAL). Mount and solder the red LED next, aligning its short lead with the flat side of the silkscreen circle marked LED.

Mount and solder capacitors C1 to C7. C2 and C3 are small 22 pF capacitors. C5, C6, and C7 are 0.1 uF capacitors. C1 is a 47 to 100 uF capacitor, and C4 is a 0.0047 uF capacitor. Note that C1 can be substituted with any value between 47 and 100 uF.

Mount and solder the 7805 voltage regulator and the on-off slide switch. Mount and solder the microphone jack, button battery holder, and the 9-volt battery cap. Keep the wires on the 9-volt battery cap short, approximately 1.5 inches.

Mount and solder the 10-pin right-angle header in the upper left corner of the board (identified as R1). Mount and solder the 7-pin right-angle header in the lower left corner of the board.

Mount and solder a 2-pin header in the WD location next to R4. Install the integrated circuits into their appropriate IC sockets, ensuring correct orientation.

Keypad

The keypad is constructed using 12 normally open momentary contact switches. Place each switch in its mounting position and bend the leads inward to secure it to the PCB for soldering. After mounting and soldering the 12 keypad switches to the top of the keypad PCB, connect the 7-pin female header to the bottom of the keypad PCB.

Non-Volatile Memory Back-up

The PCB-mounted coin battery holder accommodates a 2032 coin battery, which supplies backup power for the SRAM. This allows word patterns to be retained in memory even when the main circuit is turned off.

Selecting Vocabulary Size and Word Length

Figure 4: Diagram showing the keypad configuration and a photograph of the assembled keypad.

The default vocabulary and word configuration for the circuit is 40 words, each with a length of 0.96 seconds. To change this to a 20-word configuration (1.92 seconds each), place a jumper on the two-pin WD header. If the 40-word vocabulary is not needed, configuring the circuit for the 20-word vocabulary is suggested, as this configuration usually provides better recognition accuracy.

Using The Speech Recognition Circuit

The keypad and digital display are used to communicate with and program the HM2007 chip. Plug the digital display into the 10-pin header on the main circuit board. Plug the keypad into the 7-pin header on the main circuit board. Plug the headset microphone into the microphone jack. Adjust the microphone to be positioned about 1 inch away from your mouth.

Keypad Use:

The keypad is made up of 12 normally open momentary contact switches.

Keypad layout: 1, 2, 3, 4, 5, 6, 7, 8, 9, ❌ CLR, 0, ▶️ TRN.

Training Words for Recognition

The ❌ CLR key functions as Clear, and the ▶️ TRN key functions as Train.

When the circuit is turned on, "00" appears on the digital display, and the red LED (READY) is lit, indicating the circuit is waiting for a command.

✅ Finished SR-07 Circuit

To Train:

Press "1" on the keypad (the display will show "01" and the LED will turn off). Then, press the ▶️ TRN key (the LED will turn on) to place the circuit in training mode for word one.

Say the target word clearly into the headset microphone. The circuit signals acceptance of the voice input by blinking the LED off then on. The word (or utterance) is now identified as the "01" word. If the LED did not flash, restart by pressing "1" and then the ▶️ TRN key.

You can continue training new words. Press "2" then ▶️ TRN to train the second word, and so on. The circuit can accept and recognize up to 40 words (numbers 1 through 40). It is not necessary to train all word spaces; if you only require 10 target words, that is all you need to train.

Testing Recognition:

Repeat a trained word into the microphone. The number of the word should be displayed on the digital display. For example, if the word "directory" was trained as word number 25, saying "directory" into the microphone will cause the number 25 to be displayed.

Error Codes

The chip provides the following error codes:

55 = word too long
66 = word too short
77 = no match

Clearing Memory

To erase all words in memory, press "99" and then "❌ CLR". The numbers displayed will be "19"; this is not an error. The numbers will quickly scroll by on the digital display as the memory is erased.

Changing & Erasing Words

Trained words can be easily changed by overwriting the original word. For instance, if word six was "Capital" and you want to change it to "State", simply retrain the word space by pressing "6", then the ▶️ TRN key, and saying "State" into the microphone.

To erase a word without replacing it, press the word number (e.g., six), then press the ❌ CLR key. Word six is now erased.

Simulated Independent Recognition

The speech recognition system is speaker-dependent, meaning the voice that trained the system yields the highest recognition accuracy. However, you can simulate independent speech recognition.

To simulate speaker independence, use more than one word space for each target word. Set the SR-07 for a 40-word vocabulary and use four word spaces per target word. This allows for four different enunciations of each target word (speaker independent).

The four word spaces are chosen to minimize software and hardware interfaces. This is accomplished by ensuring all four word spaces share the same Least Significant Digit (LSD). By decoding only the LSD number on the digital display, the words can be recognized.

Using this procedure, word spaces 01, 11, 21, and 31 are allocated to the first target word. The Most Significant Digit (MSD) is dropped by the interfacing circuits. By decoding only the LSD number (e.g., 1 of "X1", where X is any number), the target word can be recognized.

Continue this for the remaining word spaces. For instance, the second target word will use word spaces 02, 12, 22, and 32. Continue this process until all words are programmed.

When experimenting with speaker independence, use different people to train a target word. This enables the system to recognize different voices, inflections, and enunciations of the target word. Allocating more system resources for independent recognition makes the circuit more robust.

For designing the most robust and accurate system possible, train target words using one voice with varying inflections and enunciations.

Rhyming words

Rhyming words sound alike (e.g., cat, bat, sat, fat). Because of their similar sounds, they can confuse the speech recognition circuit. When choosing target words, avoid using rhyming words.

The Voice With Stress & Excitement

Stress and excitement alter a person's voice, affecting the accuracy of the circuit's recognition. For example, if you are at your workbench programming target words like "fire", "left", "right", "forward", etc., and then use the circuit to control a flight simulator game, you might yell "FIRE!... Fire!...FIRE!!...LEFT...go RIGHT!". In the heat of action, your voice will sound very different than when you were sitting calmly programming the circuit. To achieve higher accuracy, mimic the excitement in your voice when programming the circuit.

These factors are important for achieving the highest possible accuracy. This becomes increasingly critical when the speech recognition circuit is used outside the lab in real-world applications.

Interfacing The Circuit To The Outside World

The circuit design for interfacing the speech recognition system to the outside world controls ten switches. This design idea aligns with the robust speech recognition system discussed previously. While the effective vocabulary may drop from forty to ten words, this approach yields a more robust and accurate system.

Interface Circuit

Figure 5: Schematic diagrams for interface circuits, illustrating control of loads using a 4028 decoder, 74LS373 buffer, 4013 flip-flop, and relays.

The interface circuit connects to the 10-pin Right Angle interface header on the circuit board, which is also used for the Digital Display board.

The 4028 has ten output lines. Whatever number is displayed on the LSD, the corresponding line number from the 4028 will be brought high. This high signal can be connected to an NPN transistor to control a DC load (as shown in box A) or control an AC or DC load using a simple relay (as shown in box B).

A disadvantage of this simple setup is that only one switch out of ten can be turned on at any given time. A solution is to insert a flip-flop (shown in box C) between the 4028 and the NPN transistor. The 4013 IC contains two flip-flops; only one is shown in the drawing. The flip-flop acts as simple memory: when the input line goes high, its output line goes high, turning on the NPN transistor. When the output line goes low, it stays high. A second high signal on the flip-flop's output line brings the output low.

Consider a real-world example: powering a printer connected through the speech recognition circuit and a 4013-controlled switch or relay. If the target word is "printer", using the command word "printer" turns the printer on and applies power. At this point, other circuits connected to the speech board can also be turned on or off, as the 4013 keeps its output high even when the signal goes low. To turn the printer off, repeat the command "printer". The second time the line goes high, the 4013 output goes low. The same command is used to turn the unit on and off. Other lines can be toggled without affecting the status of other output lines.

Voice Security System

This circuit is not designed for commercial voice security applications, but experimentation for this purpose is encouraged. A common approach involves using three or four keywords spoken and recognized in sequence to unlock or grant entry.

CPU Mode

The HM2007 speech recognition chip has a CPU mode for use when connected to a host computer system or microcontroller. Interfacing the HM2007 to a host computer requires writing driver software and designing/building the hardware interface to the computer data bus.

Aural Interfaces

It has been found that mixing visual and aural information is not effective. Products requiring visual confirmation of an aural command significantly reduce efficiency. To create an effective Aural User Interface (AUI), products need to understand (recognize) commands given in an unstructured and efficient manner, similar to how people typically communicate verbally.

Learning To Listen

The ability to listen to one person speak among several at a party is beyond the capabilities of current speech recognition systems. Speech recognition systems cannot yet separate and filter out extraneous noise.

Speech recognition is not the same as understanding speech. Understanding the meaning of words is a higher intellectual function. While a circuit can respond to a vocal command, it does not mean it understands the spoken command. In the future, voice recognition systems may be able to distinguish nuances of speech and meanings of words, enabling them to "Do what I mean, not what I say!"

Speaker Dependent / Speaker Independent

Speech recognition is divided into two broad processing categories: speaker-dependent and speaker-independent.

Speaker-dependent systems are trained by the individual user. These systems can achieve a high command count and over 95% accuracy for word recognition. The drawback is that the system only responds accurately to the individual who trained it. This is the most common approach in software for personal computers.

Speaker-independent systems are trained to respond to a word regardless of who speaks. Therefore, the system must respond to a wide variety of speech patterns, inflections, and enunciations of the target word. The command word count is typically lower than speaker-dependent systems, but high accuracy can still be maintained within processing limits. Industrial applications often require speaker-independent voice recognition systems.

Recognition Style

In addition to speaker-dependent/independent classification, speech recognition also considers the style of speech it can recognize. There are three styles: isolated, connected, and continuous.

Isolated:

Words are spoken separately or isolated. This is the most common speech recognition system available today. The user must pause between each word or command spoken.

Connected:

This is an intermediate step between isolated word and continuous speech recognition. It allows users to speak multiple words. The HM2007 can be configured to identify words or phrases up to 1.92 seconds in length, reducing the word recognition dictionary number to 20.

Continuous:

This is the natural conversational speech used in everyday life. It is extremely difficult for a recognizer to sift through the sound as words tend to merge. For instance, "Hi, how are you doing?" might sound like "Hi,.howyadoin" to a computer. Continuous speech recognition systems are on the market and under continual development.

More On The HM2007 Chip

The HM2007 is a CMOS voice recognition LSI (Large Scale Integration) circuit. The chip includes an analog front end, voice analysis, regulation, and system control functions. It can be used in a stand-alone mode or connected to a CPU.

Features:

Single-chip voice recognition CMOS LSI
Speaker dependent
External RAM support
Maximum 40-word recognition (0.96 second per word)
Maximum word length 1.92 seconds (20 words)
Microphone support
Manual and CPU modes available
Response time less than 300 milliseconds
5V power supply

More information on the HM2007 chip is available in the HM2007 data booklet (DS-HM2007).

Parts List

Placement	Item	Quantity
Keypad	PCB	3 pieces
	Push-button Switches	12
U4	HM2007 PLCC	1
U1	52-pin socket	1
U2	7805 Voltage Regulator	1
	74LS373	1
U3	20-pin socket	1
	SRAM 8K X 8	1
	28-pin socket	1
U4 U5	4511	2
	16-pin socket	2
U6 U7	220ohm 1/8W Resistors	16
U8 U9	7-Segment Displays	2
	14-pin socket	2
X1	XTAL 3.57 MHz	1
S1	Toggle Switch	1
BT1	9V Battery Snap	1
BT2	Coin Batter Holder	1
R1	100K 1/4W Resistor	1
R2	6.8K 1/4W Resistor	1
R3	22K 1/4W Resistor	1
R4	330ohm 1/4W Resistor	1
C1	100 uF Capacitor	1
C2 C3	22 pF Capacitors	2
C4	.0047 uF Capacitor	1
C5 C6 C7	.01 uF Capacitors	3
D1 D2	1N914 diodes	2
D3	Red LED	1
	9V Battery Holder	1
P1	PC mount microphone jack	1
P5	2-position header	1
	Jumper	1
	Headset Microphone	1
	3V Coin Battery	1
	2/56 Hex nuts, Screws & Lock washers	2 each
	7-pin headers (male and female)	1 each
	10-pin headers (male and female)	1 each