There is an amazing file of words and pronunciation information at Carnegie Mellon University. Over 100 thousand words and variations. From their web site: "Its entries are particularly useful for speech recognition and synthesis, as it has mappings from words to their pronunciations in the ARPAbet phoneme set, a standard for English pronunciation. The current phoneme set contains 39 phonemes, vowels carry a lexical stress marker". A typical line in the file looks like this
RUNAROUND R AH1 N AH0 R AW2 N D
The file is easy to work with; every line ends with a single newline character and the word is separated from the data with two spaces and each data item is separated by a single space. The data items are described in another file that has entries like this
Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
for 39 phonemes.
Here's some JSL to load the 100K+ words into an associative array.
source = Load Text File( "$DESKTOP/grammar/cmudict-0.7b",charset("windows-1252") );
symbols = Load Text File( "$DESKTOP/grammar/cmudict-0.7b.symbols" );
symlist = Words( symbols, "\!n" );
symPat = symlist; // make a copy
Substitute Into( symPat, Expr( {} ), Expr( Pat Altern() ) ); // convert list to alternation pattern
sympat = Eval( sympat ); // build the pattern, once
lineEnding = "\!n";
commentLine = ";" + Pat Break( lineEnding ) + lineEnding;
dataLine = Pat Break( " " ) >> word + " " + Pat Repeat( " " + symPat ) >> pronounce[word] + lineEnding;
pronounce = Associative Array();
rc = Pat Match( source, Pat Pos( 0 ) + Pat Repeat( (dataLine | commentLine) + Pat Fence() ) + Pat R Pos( 0 ) );
if( rc, Show( pronounce["ELEPHANTS"] ), print("failed"));
pronounce["ELEPHANTS"] = " EH1 L AH0 F AH0 N T S";
Lines 1 and 2 load the downloaded files from a folder on the desktop into JSL strings. The huge file has entries like the RUNAROUND example above. The tiny file has a single phoneme on each line. These symbols are loaded into a list on line 3. The words() function uses the newline to split the text into a list of individual words. symlist holds the words, and on line 4 a copy is made in sympat.
On line 5 SubstituteInto() manipulates the JSL list expression of comma separated strings, turning it into a JSL function call to Pat Altern() with a list of comma separated strings. Line 6 evaluates the Pat Altern() function, producing an actual pattern value. This is roughly equivalent to somehow constructiong this JSL and evaluating it:
sympat = "AA" | "AA0" | ... | "Z" | "ZH"; // 39 phonemes in all
Manipulating expressions is an advanced topic, and can be very useful.
Line 7 is assigning a simple newline string to a JSL variable. It is easier to read the variable name than the escape sequence. Line 8 describes what a comment line looks like in the huge file. Here's the block of comments that Line 8 will be skipping over:
;;; # CMUdict -- Major Version: 0.07
;;; # ========================================================================
;;; # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.
;;; # Redistribution and use in source and binary forms, with or without
;;; # modification, are permitted provided that the following conditions
;;; # are met:
;;; # 1. Redistributions of source code must retain the above copyright
;;; # notice, this list of conditions and the following disclaimer.
;;; # The contents of this file are deemed to be source code.
;;; # 2. Redistributions in binary form must reproduce the above copyright
;;; # notice, this list of conditions and the following disclaimer in
;;; # the documentation and/or other materials provided with the
;;; # distribution.
;;; # This work was supported in part by funding from the Defense Advanced
;;; # Research Projects Agency, the Office of Naval Research and the National
;;; # Science Foundation of the United States of America, and by member
;;; # companies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge
;;; # the contributions of many volunteers to the expansion and improvement of
;;; # this dictionary.
;;; # THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND
;;; # ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
;;; # THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
;;; # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY
;;; # NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
;;; # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
;;; # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
;;; # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
;;; # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
;;; # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
;;; # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Line 9 describes the data lines in the huge file. Line 9 does a lot of work, here it is again
dataLine = Pat Break( " " ) >> word + " " + Pat Repeat( " " + symPat ) >> pronounce[word] + lineEnding;
PatBreak() matches text up to a blank and stores the text in word. There are two blanks after the word; one of them is matched before patRepeat and one of them is matched by the first time patRepeat runs its pattern. The patRepeat pattern uses the symPat to validate every phoneme; if a bad phoneme was present, the PatMatch would fail and only part of the associative array dictionary would be loaded.
patRepeat is GREEDY and goes as far as it can, gobbling up blanks and symPats. When it hits the lineEnding, it can't go further and the >> immediate assignment operator stores the repeated match into pronounce[word]. MAGIC! the value of word used to index pronounce is the value just matched! A dictionary entry was just created!
Line 10 creates the dictionary, an associative array. It's OK to create it after defining the pattern on Line 9; the pattern won't run until the next line.
pronounce = Associative Array();
Line 11:
rc = Pat Match( source, Pat Pos( 0 ) + Pat Repeat( (dataLine | commentLine) + Pat Fence() ) + Pat R Pos( 0 ) );
uses PatPos and PatRpos to make sure the entire string is matched. In between is another PatRepeat that parses either a data line or a comment line, over and over, 100K+ times. The PatFence is an optimization that tells the pattern matcher not to save backtracking information. If the pattern failed, near the end, backtracking would allow the matcher to back up and try different choices. For this file, that would be pointless, and the memory required to hold all the backtracking information would be large. The PatFence hint speeds it up.
pronounce["ZEBRA"]
" Z IY1 B R AH0"
pronounce["AARDVARK"]
" AA1 R D V AA2 R K"
pronounce["SUPERCALIFRAGILISTICEXPEALIDOSHUS"]
" S UW2 P ER0 K AE2 L AH0 F R AE1 JH AH0 L IH2 S T IH0 K EH2 K S P IY0 AE2 L AH0 D OW1 SH AH0 S"