The arpabet2ipa() function incorrectly places stress marks on the first vowel in a word instead of the vowel that has the stress digit (1 or 2) in the ARPABET input.
Root Cause
The bug occurs in the attach_tones_to_vowels() function when combined with how ARPABET stress markers are converted.
The problem:
- When
translate_string() converts ARPABET like ER1 (stressed "er" sound), it produces tokens ['ɝ', 'ˈ'] - with the stress marker AFTER the vowel in the list
- The
attach_tones_to_vowels() function searches backward (searchstep=-1) to find a vowel to attach each stress marker to
- When it finds a stress marker (e.g., at position 8), it searches backward for a vowel
- It encounters the vowel that the stress CAME FROM (e.g.,
ɝ at position 7), but since the stress marker is after it in the list, the backward search continues past it to find the previous vowel (e.g., æ at position 0)
- The stress gets attached to the wrong vowel
Problematic Code
From phonecodes.py lines 48-62:
def attach_tones_to_vowels(il, tones, vowels, searchstep, catdir):
"""Return a copy of il, with each tone attached to nearest vowel if any.
searchstep=1 means search for next vowel, searchstep=-1 means prev vowel.
catdir>=0 means concatenate after vowel, catdir<0 means cat before vowel.
Tones are not combined, except those also included in the vowels set.
"""
ol = il.copy()
v = 0 if searchstep > 0 else len(ol) - 1
t = -1
while 0 <= v and v < len(ol):
if (ol[v] in vowels or (len(ol[v]) > 1 and ol[v][0] in vowels)) and t >= 0:
ol[v] = ol[v] + ol[t] if catdir >= 0 else ol[t] + ol[v]
ol = ol[0:t] + ol[(t + 1) :] # Remove the tone
t = -1 # Done with that tone
if v < len(ol) and ol[v] in tones:
t = v
v += searchstep
return ol
When searching backward (searchstep=-1):
- The algorithm finds a stress marker at some position
t
- It continues decrementing
v to find a vowel
- Bug: When the stress marker appears RIGHT AFTER its vowel in the token list, the vowel at position
v = t - 1 is skipped, and the stress attaches to an earlier vowel
Minimal Reproducible Example
from phonecodes import phonecodes as pc
# Test case: ARPABET with stress on second vowel (ER1)
arpabet = "AE0 D V ER1 T AH0 Z M AH0 N T"
# ^^0 ^^1 ^^0 ^^0
# (no) (PRIMARY!) (no) (no)
ipa = pc.arpabet2ipa(arpabet)
print(f"Input: {arpabet}")
print(f"Output: {ipa}")
print()
# Expected: stress on ɝ (from ER1)
# Actual: stress on æ (from AE0) ❌
tokens = ipa.split()
print("Token breakdown:")
for i, token in enumerate(tokens):
stress_marker = " ← STRESS" if 'ˈ' in token or 'ˌ' in token else ""
print(f" {i}: {token}{stress_marker}")
Output:
Input: AE0 D V ER1 T AH0 Z M AH0 N T
Output: ˈæ d v ɝ t ə z m ə n t
Token breakdown:
0: ˈæ ← STRESS
1: d
2: v
3: ɝ
4: t
5: ə
...
Expected output: æ d v ˈɝ t ə z m ə n t (stress on ɝ, the vowel from ER1)
Actual output: ˈæ d v ɝ t ə z m ə n t (stress on æ, the vowel from AE0)
Additional Test Cases
# Works correctly with AE1 (stress on first vowel)
print(pc.arpabet2ipa("AE1 D V ER0 T"))
# Output: ˈæ d v ɚ t ✅ Correct!
# Works correctly with two stresses
print(pc.arpabet2ipa("AE1 D V ER1 T"))
# Output: ˈæ d v ɝˈ t
# Note: Second stress appears AFTER ɝ, showing the order issue
# Fails with single stress on later vowel
print(pc.arpabet2ipa("AE0 ER1 T"))
# Output: ˈæ ɝ t ❌ Wrong! Should be: æ ˈɝ t
Impact
This bug affects any ARPABET conversion where:
- The word has exactly one stress marker (most common case)
- The stress is NOT on the first vowel
- Users relying on this library for CMUDict→IPA conversion will get incorrect stress placement for the majority of English words
Expected Behavior
Stress markers should be placed on the vowel that has the stress digit (1 or 2) in the original ARPABET, not redistributed to other vowels.
The
arpabet2ipa()function incorrectly places stress marks on the first vowel in a word instead of the vowel that has the stress digit (1 or 2) in the ARPABET input.Root Cause
The bug occurs in the
attach_tones_to_vowels()function when combined with how ARPABET stress markers are converted.The problem:
translate_string()converts ARPABET likeER1(stressed "er" sound), it produces tokens['ɝ', 'ˈ']- with the stress marker AFTER the vowel in the listattach_tones_to_vowels()function searches backward (searchstep=-1) to find a vowel to attach each stress marker toɝat position 7), but since the stress marker is after it in the list, the backward search continues past it to find the previous vowel (e.g.,æat position 0)Problematic Code
From
phonecodes.pylines 48-62:When searching backward (
searchstep=-1):tvto find a vowelv = t - 1is skipped, and the stress attaches to an earlier vowelMinimal Reproducible Example
Output:
Expected output:
æ d v ˈɝ t ə z m ə n t(stress onɝ, the vowel fromER1)Actual output:
ˈæ d v ɝ t ə z m ə n t(stress onæ, the vowel fromAE0)Additional Test Cases
Impact
This bug affects any ARPABET conversion where:
Expected Behavior
Stress markers should be placed on the vowel that has the stress digit (1 or 2) in the original ARPABET, not redistributed to other vowels.