feat: Frequency-ranked daily words and supplements for 38 languages#118
Conversation
Use FrequencyWords (OpenSubtitles frequency data) to generate: - daily_words.txt: top 2,000 most common words from existing word lists - supplement.txt: thousands of additional valid 5-letter guesses This addresses the core UX problem: 53% of all guesses across all languages are rejected as invalid. Languages like Italian (120% invalid rate), French (75%), and Spanish (67%) should see dramatic improvement. Languages processed: ar, br, ca, cs, da, de, el, eo, es, et, eu, fa, fr, gl, he, hr, hu, hy, hyw, is, it, ka, lt, lv, mk, nb, nn, nl, pt, ro, ru, sk, sl, sr, sv, tr, uk, vi Excluded (already high quality): en, fi, pl, bg, ko Not covered (no frequency data): az, ckb, fo, fur, fy, ga, gd, ia, ie, lb, ltg, mi, mn, nds, ne, oc, pau, qya, rw, tk, tlh
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new CLI script that ingests FrequencyWords and wordfreq data to generate per-language daily and supplement 5‑letter lists; introduces generated language data and SOURCES, test helpers and new daily-word tests, keyboard/blocklist updates, a common-names filter, README edits, and a .gitignore entry for FrequencyWords cache. Changes
Sequence DiagramsequenceDiagram
participant User as User / CLI
participant Script as improve_word_lists.py
participant FS as File System
participant Repo as FrequencyWords (GitHub)
participant Wordfreq as wordfreq corpus
User->>Script: run download / process / batch
Script->>FS: check cache `scripts/.freq_data/`
alt cache miss
Script->>Repo: fetch selected language frequency files
Repo-->>FS: write frequency files
end
Script->>FS: load language chars, main word list, blocklist, common names, existing supplement
Script->>Wordfreq: query/extract candidate words
Script->>Script: filter valid 5‑letter words, remove blocklist/names/roman numerals, score by frequency
Script->>FS: write `{lang}_daily_words.txt`, `{lang}_5words_supplement.txt`, create/patch `SOURCES.md`
Script-->>User: print summary (counts, matches, stats)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (3)
scripts/improve_word_lists.py (2)
262-293: Template hardcodes "wooorm/dictionaries (Hunspell)" as the base source for all languages.Not all languages source from wooorm/dictionaries. Consider adding a per-language source map, or at minimum noting in the template that the source should be verified. Also,
write_texton line 292 should useencoding="utf-8".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 262 - 293, The SOURCES.md template currently hardcodes "wooorm/dictionaries (Hunspell)" for every language and writes with write_text(sources_md) without an explicit encoding; update the generation to (1) look up a per-language source via a new mapping (e.g., SOURCES_MAP) using LANG_NAMES.get(lang, lang) as fallback and include that source string in sources_md (or if missing, include a clear "verify source for this language" note), and (2) call sources_path.write_text(sources_md, encoding="utf-8") to ensure proper encoding when writing the file; change references in the block that build sources_md and the write_text call accordingly (use sources_path, LANG_NAMES.get(lang, lang), and sources_md to locate where to modify).
311-327: Ruff S607:subprocess.runwith partial executable path"git".This is flagged by static analysis but is standard practice for dev/build scripts. If you want to suppress the warning, you could add a
# noqa: S607comment, or resolvegitviashutil.which("git")with a clear error if not found.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 311 - 327, The subprocess.run calls use the bare "git" executable which triggers S607; modify the script to resolve git via shutil.which("git") (and raise a clear error if None) and use that absolute path in the two subprocess.run invocations that clone and run sparse-checkout (the calls referencing target, repo_dir and the "git" args), or alternatively append a "# noqa: S607" on those subprocess.run lines if you prefer to suppress the warning; ensure both occurrences are updated consistently.tests/test_word_lists.py (1)
283-296:test_supplement_disjoint_from_maindoesn't test daily words and belongs inTestWordListBasicsThis test checks
supplement ∩ main = ∅and never referencesdailywords. Placing it inTestDailyWordsis misleading and makes the class semantics unclear. It should either live inTestWordListBasicsalongside the other word-list invariant tests, or be renamed/reworked to testdaily ∩ supplement = ∅(which follows transitively but is worth an explicit assertion).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_word_lists.py` around lines 283 - 296, The test test_supplement_disjoint_from_main is in the wrong test class (TestDailyWords) because it asserts supplement ∩ main = ∅ and never touches daily words; move this test method into TestWordListBasics (where other word-list invariants live) or alternatively change its assertion to explicitly test daily ∩ supplement = ∅ using load_daily (so it actually exercises daily semantics). Update the class container for test_supplement_disjoint_from_main (or rename/rewrite the test to reference load_daily) so the test name and class accurately reflect what is being validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/improve_word_lists.py`:
- Around line 62-83: The three loader functions load_characters, load_word_list,
and load_existing_supplement use path.read_text() without specifying an encoding
which can break on Windows locales; update each to read text with UTF-8 (e.g.,
path.read_text(encoding="utf-8") or open the file with encoding="utf-8") so they
match load_frequency_data and reliably handle diacritics and non-ASCII
characters.
- Around line 247-260: The overwrite guard only checks daily_path so
supplement_path can be overwritten silently; update the guard in the block that
checks overwrite to also check supplement_path.exists() and skip/return (set
result["status"]="skipped" and reason) if either path exists and overwrite is
False; when writing files, call daily_path.write_text(..., encoding="utf-8") and
supplement_path.write_text(..., encoding="utf-8") to match the read path; refer
to the variables/functions daily_path, supplement_path, overwrite, write_text,
daily_words and supplement_sorted to locate the changes.
In `@tests/conftest.py`:
- Around line 53-59: The test helper load_daily_words currently reads lines
verbatim which diverges from production: update the function to ignore lines
that start with '#' (comment lines) and normalize each word to lowercase using
.lower() before returning; locate the function load_daily_words in
tests/conftest.py and change the list comprehension/filtering so it skips blank
lines and lines where stripped value startswith '#' and returns each word as
stripped_value.lower().
In `@tests/test_word_lists.py`:
- Around line 235-296: Run Black to reformat this test file so CI passes: apply
the project's formatting (e.g., black --line-length 100 tests/test_word_lists.py
or black webapp/ tests/) and commit the resulting changes; this will adjust
formatting in class TestDailyWords and its methods
(test_daily_words_subset_of_main, test_daily_words_no_duplicates,
test_daily_words_are_5_letters, test_supplement_disjoint_from_main) and preserve
the SUPPLEMENT_OVERLAP_XFAIL constant while ensuring the file matches the
repository's Black style.
- Line 239: Annotate the class-level mutable set SUPPLEMENT_OVERLAP_XFAIL with
ClassVar to satisfy Ruff RUF012: add an import for ClassVar (and Set if using
typing.Set) and change the declaration to something like
SUPPLEMENT_OVERLAP_XFAIL: ClassVar[Set[str]] = {"pl", "ckb"} (or
ClassVar[set[str]] if using Py3.9+ built-ins) so the attribute is explicitly a
class variable and the linter stops flagging it.
In `@webapp/data/languages/ar/SOURCES.md`:
- Around line 5-7: The SOURCES.md template generation in improve_word_lists.py
currently inserts a wooorm/dictionaries (Hunspell) attribution for all
languages; update the template generation code that produces SOURCES.md to
either (a) pull a per-language source from the language metadata (e.g., a
language-specific dict or JSON entry) when available, or (b) explicitly mark the
attribution as a default placeholder (e.g., "Default placeholder: verify
source") when no per-language source exists; locate the template/rendering block
in improve_word_lists.py responsible for creating SOURCES.md and modify it to
prefer per-language source data and fall back to the placeholder text.
In `@webapp/data/languages/br/br_daily_words.txt`:
- Around line 240-259: The br_daily_words.txt contains many pure Roman numerals
(letters m,d,c,l,x,v,i) that should be removed; update
scripts/improve_word_lists.py to add a post-processing step that filters out any
candidate word where every character is in the set {m,d,c,l,x,v,i} and that
matches a Roman-numeral regex (e.g.
^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$ case-insensitive),
removing those entries before writing/regenerating br_daily_words.txt and ensure
the step runs as the final cleanup when building the daily-answer pool.
In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Line 1959: The word list contains Roman numerals like "xviii" that should be
excluded from the daily-words pool; update the corpus filtering logic (e.g., in
functions that load or select daily words such as loadFrequencyCorpus,
generateDailyWords, or selectDailyWord) to detect and remove Roman-numeral
patterns before selection — implement a regex-based filter for valid Catalan
words (reject strings matching Roman numeral regex like ^[ivxlcdmIVXLCDM]+$ and
variants with punctuation) and apply it where the ca_daily_words.txt entries are
normalized so numerals are never offered as daily answers.
- Around line 430-431: The list contains both variants "danès" and the
nonstandard "danés", which causes mismatches; remove the acute-accent entry
"danés" from the daily-answer candidates (or replace/alias it so only the
standard Central Catalan form "danès" is accepted) to ensure only the normative
form remains in the candidate pool and prevent rejection of correct guesses.
In `@webapp/data/languages/da/da_daily_words.txt`:
- Line 351: The daily words file contains offensive entries that bypass
blocklist filtering because _get_daily_word calls
get_daily_word_consistent_hash(self.daily_words, set(), ...) with an empty
blocklist; to fix, either (A) remove the offensive tokens from the
da/daily_words.txt source (remove entries like "dildo", "fisse", "kusse",
"pisse", "bitch"), or (B) modify _get_daily_word to apply the app's blocklist by
passing the actual blocklist set (instead of set()) into
get_daily_word_consistent_hash or by filtering self.daily_words with the
blocklist before selection; update load_daily_words/_get_daily_word to ensure
daily_words are filtered against the same blocklist used for the main word list.
In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 100: The daily-word pipeline allows de_daily_words.txt entries to bypass
blocklist filtering because app.py constructs/passes an empty blocklist (set())
into the daily-word selector; remove offensive entries from de_daily_words.txt
or, preferably, enforce blocklist filtering by applying de_blocklist.txt when
loading/selecting daily words—update the function that reads/selects daily words
(the daily-word loader/selector called in app.py where blocklist is passed) to
merge de_blocklist.txt into the blocklist instead of using set(), and filter out
any word present in de_blocklist.txt (e.g., apply a simple set difference or
contains check before accepting a daily word).
In `@webapp/data/languages/el/el_5words_supplement.txt`:
- Around line 312-315: Lines contain 15 Greek words using the non-final sigma
character 'σ' (U+03C3) instead of the final sigma 'ς' (U+03C2); replace all
occurrences for the listed words (e.g., εκτος -> εδώ: εκτός, εκτοσ -> εκτός,
καλως/καλωσ -> καλώς, μερες/μερεσ -> μέρες, μερος/μεροσ -> μέρος, μηνες/μηνεσ ->
μήνες, τελος/τελοσ -> τέλος, χωρις/χωρισ -> χωρίς and the other 8 single
violations) so they use final sigma 'ς' and delete the 7 duplicate entries that
are the same word differing only by σ vs ς (the pairs starting with εκτος/εκτοσ,
καλως/καλωσ, μερες/μερεσ, μερος/μεροσ, μηνες/μηνεσ, τελος/τελοσ, χωρις/χωρισ);
ensure all 15 affected lines are normalized to the correct final-sigma spelling
(U+03C2).
---
Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 262-293: The SOURCES.md template currently hardcodes
"wooorm/dictionaries (Hunspell)" for every language and writes with
write_text(sources_md) without an explicit encoding; update the generation to
(1) look up a per-language source via a new mapping (e.g., SOURCES_MAP) using
LANG_NAMES.get(lang, lang) as fallback and include that source string in
sources_md (or if missing, include a clear "verify source for this language"
note), and (2) call sources_path.write_text(sources_md, encoding="utf-8") to
ensure proper encoding when writing the file; change references in the block
that build sources_md and the write_text call accordingly (use sources_path,
LANG_NAMES.get(lang, lang), and sources_md to locate where to modify).
- Around line 311-327: The subprocess.run calls use the bare "git" executable
which triggers S607; modify the script to resolve git via shutil.which("git")
(and raise a clear error if None) and use that absolute path in the two
subprocess.run invocations that clone and run sparse-checkout (the calls
referencing target, repo_dir and the "git" args), or alternatively append a "#
noqa: S607" on those subprocess.run lines if you prefer to suppress the warning;
ensure both occurrences are updated consistently.
In `@tests/test_word_lists.py`:
- Around line 283-296: The test test_supplement_disjoint_from_main is in the
wrong test class (TestDailyWords) because it asserts supplement ∩ main = ∅ and
never touches daily words; move this test method into TestWordListBasics (where
other word-list invariants live) or alternatively change its assertion to
explicitly test daily ∩ supplement = ∅ using load_daily (so it actually
exercises daily semantics). Update the class container for
test_supplement_disjoint_from_main (or rename/rewrite the test to reference
load_daily) so the test name and class accurately reflect what is being
validated.
| aroma | ||
| arosa | ||
| array | ||
| arsch |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, verify the file exists and check the specific lines mentioned
if [ -f webapp/data/languages/de/de_daily_words.txt ]; then
echo "=== File exists. Checking specified lines ===";
echo "Line 100 (arsch):";
sed -n '100p' webapp/data/languages/de/de_daily_words.txt;
echo "Line 535 (fotze):";
sed -n '535p' webapp/data/languages/de/de_daily_words.txt;
echo "Line 1229 (nsdap):";
sed -n '1229p' webapp/data/languages/de/de_daily_words.txt;
echo "Line 1231 (nutte):";
sed -n '1231p' webapp/data/languages/de/de_daily_words.txt;
else
echo "File not found";
fiRepository: Hugo0/wordle
Length of output: 199
🏁 Script executed:
# Check app.py for load_daily_words function and blocklist filtering
echo "=== Searching for load_daily_words and filtering logic in app.py ===" && \
rg -n "load_daily_words|blocklist|block_list|filter" webapp/app.py -A 5 -B 2Repository: Hugo0/wordle
Length of output: 6567
🏁 Script executed:
# Search for any blocklist files
echo "=== Looking for blocklist files ===" && \
find webapp -type f \( -name "*blocklist*" -o -name "*block_list*" \) 2>/dev/nullRepository: Hugo0/wordle
Length of output: 1057
🏁 Script executed:
# Check the load_daily_words function implementation
echo "=== Searching for load_daily_words function definition ===" && \
ast-grep --pattern 'function load_daily_words($_) {
$$$
}'Repository: Hugo0/wordle
Length of output: 114
🏁 Script executed:
# Check if the offensive words are in the German blocklist
echo "=== Checking de_blocklist.txt for offensive words ===" && \
for word in "arsch" "fotze" "nsdap" "nutte"; do
echo "Checking '$word':";
grep -c "^${word}$" webapp/data/languages/de/de_blocklist.txt && echo " FOUND in blocklist" || echo " NOT in blocklist";
doneRepository: Hugo0/wordle
Length of output: 265
🏁 Script executed:
# Count words in both files to understand the scope
echo "=== File sizes ===" && \
wc -l webapp/data/languages/de/de_daily_words.txt && \
wc -l webapp/data/languages/de/de_blocklist.txtRepository: Hugo0/wordle
Length of output: 168
🏁 Script executed:
# Check lines around the get_daily_word_consistent_hash call with daily_words
echo "=== Detailed view of daily_words priority handling ===" && \
sed -n '615,630p' webapp/app.pyRepository: Hugo0/wordle
Length of output: 711
Offensive/vulgar words in the daily answer pool with no blocklist protection
This file contains offensive German words—arsch (line 100), fotze (line 535), nutte (line 1231)—and sensitive content like nsdap (line 1229, Nazi party abbreviation). These words are not listed in de_blocklist.txt and will become valid daily answers without remediation.
The issue is architectural: when de_daily_words.txt exists, it bypasses all blocklist filtering. The code at lines 619–622 of app.py passes an empty blocklist (set()) to the daily word selector, deliberately excluding the protection applied to fallback word lists (line 624).
Either remove offensive entries from this file or apply downstream blocklist filtering to daily words before deployment.
Also applies to: 535-535, 1231-1231
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@webapp/data/languages/de/de_daily_words.txt` at line 100, The daily-word
pipeline allows de_daily_words.txt entries to bypass blocklist filtering because
app.py constructs/passes an empty blocklist (set()) into the daily-word
selector; remove offensive entries from de_daily_words.txt or, preferably,
enforce blocklist filtering by applying de_blocklist.txt when loading/selecting
daily words—update the function that reads/selects daily words (the daily-word
loader/selector called in app.py where blocklist is passed) to merge
de_blocklist.txt into the blocklist instead of using set(), and filter out any
word present in de_blocklist.txt (e.g., apply a simple set difference or
contains check before accepting a daily word).
There was a problem hiding this comment.
Addressed — offensive words (huren, neger, negro) have been added to the German blocklist and removed from daily word rotation.
There was a problem hiding this comment.
Rate Limit Exceeded
@Hugo0 have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 55 seconds before sending another message.
…DME cleanup - Fix Black formatting on test_word_lists.py (CI blocker) - Add explicit encoding="utf-8" to all read_text()/write_text() calls - Overwrite guard now checks both daily_words and supplement paths - load_daily_words test helper now matches production (comment skip, .lower()) - Clean up README: add TODO section, update data sources, streamline intro
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
tests/test_word_lists.py (1)
280-293:test_supplement_disjoint_from_mainis semantically misplaced inTestDailyWords.This test validates the supplement-vs-main relationship, not anything about daily words. Placing it (along with
SUPPLEMENT_OVERLAP_XFAIL) inTestDailyWordswill confuse contributors looking for supplement tests inTestWordListBasics. Consider moving it there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_word_lists.py` around lines 280 - 293, The test test_supplement_disjoint_from_main and its related constant SUPPLEMENT_OVERLAP_XFAIL are misplaced in TestDailyWords; move both into the TestWordListBasics test class (or the module scope where other supplement/main list tests live) so the test lives with other word-list validation tests; update imports/references if necessary (e.g., ensure load_supplement_words and load_word_list are available) and run the test suite to confirm no name collisions or xfail-scoping issues after relocating them.scripts/improve_word_lists.py (3)
280-284: Redundantis_valid_wordcall in supplement comprehension.
valid_freqwas already filtered byis_valid_word(w, char_set)at line 229, so every key is already a valid word. The second check in the set comprehension is dead code.♻️ Proposed fix
- new_supplement = { - w for w in valid_freq if w not in existing_word_set and is_valid_word(w, char_set) - } + new_supplement = {w for w in valid_freq if w not in existing_word_set}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 280 - 284, The set comprehension building new_supplement redundantly re-checks is_valid_word for each w even though valid_freq already contains only words that passed is_valid_word(char_set) earlier; update the new_supplement expression to simply include w for w in valid_freq if w not in existing_word_set (remove the is_valid_word(w, char_set) predicate) so the dead code is eliminated and behavior remains the same.
262-264: Useraiseinstead ofassertfor the daily-words safety invariant.
assertis a no-op when Python runs with-O(python -O scripts/improve_word_lists.py), silently allowing corrupt output to be written. Swap to an explicitraiseso the guard is always active.🛡️ Proposed fix
- assert not invalid_daily, f"BUG: daily words not in _5words.txt: {invalid_daily[:5]}" + if invalid_daily: + raise ValueError(f"BUG: daily words not in _5words.txt: {invalid_daily[:5]}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 262 - 264, Replace the assert-based guard that checks daily_words against existing_word_set with an explicit exception raise so it cannot be disabled by Python optimization: compute invalid_daily (as already done), and if invalid_daily is truthy, raise a RuntimeError (or ValueError) with the same informative message (including invalid_daily[:5]) instead of using assert; update the check around invalid_daily, daily_words, and existing_word_set accordingly so the script always fails loudly on this invariant violation.
381-399: Optional: resolve S607 partial-executable-path lint warning withshutil.which.Both
subprocess.runcalls use the bare"git"name. For maximum reliability, use a fully qualified path for the executable; to search for an unqualified name onPATH, useshutil.which(). For a developer-only script the real-world risk is low, but this silences the Ruff S607 flag.🔧 Proposed fix
+import shutil +GIT = shutil.which("git") or "git" + def download_frequency_words(): ... subprocess.run( [ - "git", + GIT, "clone", ... ], ... ) subprocess.run( - ["git", "sparse-checkout", "set", "content/2018"], + [GIT, "sparse-checkout", "set", "content/2018"], ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 381 - 399, The subprocess.run calls that currently use the bare "git" string (the clone call and the sparse-checkout call) should use a resolved executable path via shutil.which("git") instead: import shutil, call shutil.which("git") once (e.g., git_path), assert or raise a clear error if it returns None, and pass git_path as the first element of the args lists for the two subprocess.run invocations (keeping the existing arguments and check=True); update both occurrences that reference "git" (the clone invocation and the ["git", "sparse-checkout", "set", "content/2018"] invocation) to use the resolved git_path variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 16: The "Adding a new language" heading is currently an h3 (###) while
the previous heading is h1 ("# Wordle Global"), violating markdownlint MD001;
change the heading "### Adding a new language" to an h2 ("## Adding a new
language") so headings increment by one level (or alternatively insert an
intervening h2 if that fits the structure) to satisfy the rule.
In `@scripts/improve_word_lists.py`:
- Around line 536-542: The current handler for args.command == "process" only
treats result["status"] == "error" as a failure, so skipped/unknown languages
with result["status"] == "skipped" still exit 0; update the check after calling
process_language (the call and result variable) to treat any non-"ok" status as
a failure (or explicitly check for "skipped" and "error") and print the reason
to stderr and sys.exit(1); reference the process_language call and the
result["status"] field when making this change so automation receives a non-zero
exit code for skipped/unknown languages.
- Around line 192-194: The exclusion reason is hardcoded to "excluded (already
high quality)" which is incorrect for some languages (notably "ko" which is
excluded because FrequencyWords uses syllable blocks); change EXCLUDE from a set
to a mapping (e.g., EXCLUDE = {"en": "already high quality", "ko": "uses
syllable blocks", ...}) and update the exclusion branch that sets
result["status"] and result["reason"] to look up EXCLUDE[lang] (falling back to
a generic message if missing); also remove "ko" from the priority list (or
ensure the priority list only contains languages not present in EXCLUDE) so it
does not always print as skipped.
---
Duplicate comments:
In `@tests/test_word_lists.py`:
- Line 239: The class-level mutable set SUPPLEMENT_OVERLAP_XFAIL should be
annotated with ClassVar to indicate it's not an instance attribute; update its
declaration to include a ClassVar type (e.g., SUPPLEMENT_OVERLAP_XFAIL:
ClassVar[set[str]] = {"pl", "ckb"}) and ensure ClassVar (and typing Set/str
generics if your codebase prefers) is imported from typing at the top of
tests/test_word_lists.py so Ruff RUF012 is silenced.
---
Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 280-284: The set comprehension building new_supplement redundantly
re-checks is_valid_word for each w even though valid_freq already contains only
words that passed is_valid_word(char_set) earlier; update the new_supplement
expression to simply include w for w in valid_freq if w not in existing_word_set
(remove the is_valid_word(w, char_set) predicate) so the dead code is eliminated
and behavior remains the same.
- Around line 262-264: Replace the assert-based guard that checks daily_words
against existing_word_set with an explicit exception raise so it cannot be
disabled by Python optimization: compute invalid_daily (as already done), and if
invalid_daily is truthy, raise a RuntimeError (or ValueError) with the same
informative message (including invalid_daily[:5]) instead of using assert;
update the check around invalid_daily, daily_words, and existing_word_set
accordingly so the script always fails loudly on this invariant violation.
- Around line 381-399: The subprocess.run calls that currently use the bare
"git" string (the clone call and the sparse-checkout call) should use a resolved
executable path via shutil.which("git") instead: import shutil, call
shutil.which("git") once (e.g., git_path), assert or raise a clear error if it
returns None, and pass git_path as the first element of the args lists for the
two subprocess.run invocations (keeping the existing arguments and check=True);
update both occurrences that reference "git" (the clone invocation and the
["git", "sparse-checkout", "set", "content/2018"] invocation) to use the
resolved git_path variable.
In `@tests/test_word_lists.py`:
- Around line 280-293: The test test_supplement_disjoint_from_main and its
related constant SUPPLEMENT_OVERLAP_XFAIL are misplaced in TestDailyWords; move
both into the TestWordListBasics test class (or the module scope where other
supplement/main list tests live) so the test lives with other word-list
validation tests; update imports/references if necessary (e.g., ensure
load_supplement_words and load_word_list are available) and run the test suite
to confirm no name collisions or xfail-scoping issues after relocating them.
Pipeline improvements: - Filter blocklisted words from daily word rotation (Hebrew: 88% overlap → 0%) - Filter Roman numerals from daily words and supplements (Breton: 107 → 0) - Filter common proper nouns (387 international names) from daily words - Normalize Greek final sigma (σ→ς) in supplements with de-dup Data fixes: - German blocklist: add arsch, fotze, nsdap - Greek supplement: fix 15 words with wrong final sigma Keyboard layouts for 6 languages: - Macedonian (mk): Cyrillic layout based on Serbian - Armenian (hy, hyw): typewriter + phonetic layouts - Esperanto (eo): QWERTY with accented letters - Basque (eu): QWERTY with ñ, ç - Breton (br): AZERTY with ê, ñ, ù Tests: add TestDailyWordQuality (blocklist overlap + Roman numeral checks) Cleanup: remove dead code in app.py Regenerate daily words and supplements for all 38 languages
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
tests/conftest.py (2)
62-64:line.strip()evaluated up to three times per iteration in both helpersEach list/set comprehension calls
line.strip()once for the empty-check, once for thestartswithguard, and once for the result value. A walrus operator resolves this at no cost to readability:♻️ Proposed refactor
- return [ - line.strip().lower() for line in f if line.strip() and not line.strip().startswith("#") - ] + return [ + stripped.lower() + for line in f + if (stripped := line.strip()) and not stripped.startswith("#") + ]- return { - line.strip().lower() for line in f if line.strip() and not line.strip().startswith("#") - } + return { + stripped.lower() + for line in f + if (stripped := line.strip()) and not stripped.startswith("#") + }Also applies to: 73-75
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/conftest.py` around lines 62 - 64, The list/set comprehensions in the helpers repeatedly call line.strip() multiple times per iteration; change each comprehension to compute the stripped line once using the walrus operator (e.g., assign stripped = line.strip() in the comprehension's if/for expressions) and then use that single variable for the empty-check, startswith guard, and the lowercased/result value so you avoid redundant strip calls in the helper that returns the list (lines around the shown comprehension) and the analogous helper at lines 73-75.
35-41:load_word_listdoes not lowercase, creating a silent asymmetry withload_daily_wordsNew
TestDailyWordstests likely assert that every daily word exists in the main word list (e.g.,word in set(load_word_list(lang))).load_daily_wordsreturns lowercased words;load_word_listdoes not normalize case. If a future_5words.txtfile contains any uppercase entry, the subset check would produce a false failure. Applying the same.lower()normalization toload_word_list(andload_supplement_words) keeps the helpers consistent and test results trustworthy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/conftest.py` around lines 35 - 41, load_word_list currently returns words without lowercasing, causing inconsistency with load_daily_words which lowercases entries; update load_word_list to normalize each returned word with .lower() (and apply the same normalization to load_supplement_words) so all helper loaders consistently return lowercase strings, ensuring subset checks like in TestDailyWords behave deterministically; locate and modify the functions load_word_list and load_supplement_words to map .lower() over stripped lines before returning.scripts/improve_word_lists.py (2)
355-359: Redundantis_valid_wordcheck.
valid_freqis already built fromfreq_datafiltered byis_valid_word(line 283), so the secondis_valid_word(w, char_set)guard on line 358 is always true for anyw in valid_freq.♻️ Proposed simplification
new_supplement = { - w for w in valid_freq if w not in existing_word_set and is_valid_word(w, char_set) + w for w in valid_freq if w not in existing_word_set }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 355 - 359, The set comprehension building new_supplement redundantly re-checks is_valid_word for items already filtered into valid_freq; update the comprehension to only include words from valid_freq that are not in existing_word_set (remove the is_valid_word(w, char_set) condition) so new_supplement is computed as {w for w in valid_freq if w not in existing_word_set}; keep the variable names new_supplement, valid_freq, existing_word_set unchanged.
466-484: Consider adding atimeouttosubprocess.runcalls.These
gitoperations involve network I/O and could hang indefinitely on flaky connections. A timeout (e.g.,timeout=300) would make batch/CI usage more predictable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 466 - 484, The two networked git calls using subprocess.run in scripts/improve_word_lists.py (the clone call and the sparse-checkout call) should include a timeout to avoid hanging in CI; update both subprocess.run invocations (the one cloning "https://github.com/hermitdave/FrequencyWords.git" and the one running "git sparse-checkout set content/2018") to pass a timeout parameter (e.g., timeout=300) and optionally wrap them in a try/except for subprocess.TimeoutExpired to handle/log timeout failures cleanly.tests/test_word_lists.py (1)
340-359: Roman numeral logic is duplicated fromscripts/improve_word_lists.py.
_ROMAN_REand_is_roman_numeralhere are nearly identical to_ROMAN_RE/is_roman_numeralinscripts/improve_word_lists.py(lines 173–197). If either copy diverges, false positives or negatives will go unnoticed.Consider extracting the shared logic into a common module (e.g., a small
utilsor reuse the script's function directly) so there's a single source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_word_lists.py` around lines 340 - 359, Duplicate Roman numeral logic exists as _ROMAN_RE and _is_roman_numeral in tests/test_word_lists.py and as _ROMAN_RE / is_roman_numeral in scripts/improve_word_lists.py; refactor by extracting the shared logic into a single utility (e.g., utils/roman.py) or by importing the existing is_roman_numeral from scripts/improve_word_lists.py, then update tests to call the shared function instead of the local _is_roman_numeral and remove the duplicated _ROMAN_RE/_is_roman_numeral definitions so there is a single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/common_names.txt`:
- Around line 214-217: The list contains a duplicate entry "jones"; remove the
repeated "jones" occurrence so the name appears only once in the names list
(delete the later duplicate), and quickly scan the surrounding entries to ensure
there are no other accidental duplicates.
In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 749: Remove the three offensive entries from the daily-answer pool by
deleting the words "huren", "neger", and "negro" from de_daily_words.txt (these
are the entries at/around the reported locations) so they cannot be served as
daily answers; also add those same words to de_blocklist.txt as a secondary
safeguard (even though app.py currently bypasses blocklist for daily-words,
keeping them in the blocklist prevents other paths from using them).
---
Duplicate comments:
In `@scripts/improve_word_lists.py`:
- Around line 621-627: The current CLI handler treats process_language results
with result["status"] == "skipped" as success; update the command branch that
calls process_language (the args.command == "process" block) to treat any
non-"ok" status as failure: after calling process_language(args.lang,
args.daily_count, args.dry_run, args.overwrite) check result["status"] and if it
is not "ok" (e.g., "skipped", "error", or any unknown value) print the reason to
stderr and call sys.exit(1) (preserving the existing error print for "error"
cases), so skipped/unmapped languages produce a non-zero exit; refer to the
variables args.command, process_language(...), result["status"], and sys.exit(1)
when making the change.
- Line 539: The language "ko" is in EXCLUDE so it's always skipped and should be
removed from the priority list to avoid noisy summaries; edit the priority list
variable that currently contains "ko" (e.g., BATCH_PRIORITY or PRIORITY_LANGS)
and delete the "ko" entry so EXCLUDE remains authoritative and the summary
output no longer lists a redundant skipped language.
- Around line 75-76: The EXCLUDE set uses a single blanket comment but the code
path that logs exclusions (the branch that prints "excluded (already high
quality)") needs per-language reasons; change EXCLUDE from a set to a dict
mapping language code → reason (e.g., {"ko": "FrequencyWords uses syllable
blocks", "en": "already high quality", ...}), update any membership checks
against EXCLUDE to key lookups (use EXCLUDE.get(lang) to obtain the reason), and
replace the fixed log text "excluded (already high quality)" with the
per-language reason when reporting exclusions; update references to EXCLUDE in
the script (e.g., where EXCLUDE is checked and the exclusion message is emitted)
to use the new dict semantics.
In `@tests/test_word_lists.py`:
- Around line 238-242: In class TestDailyWords update the mutable class-level
variable SUPPLEMENT_OVERLAP_XFAIL to include a ClassVar annotation to satisfy
Ruff RUF012 (e.g. add "from typing import ClassVar" and change the declaration
to "SUPPLEMENT_OVERLAP_XFAIL: ClassVar[set[str]] = {\"pl\", \"ckb\"}"); locate
the symbol SUPPLEMENT_OVERLAP_XFAIL in TestDailyWords and add the import and
annotation accordingly (alternatively use ClassVar[frozenset[str]] with a
frozenset literal if you prefer immutability).
In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Around line 426-427: Remove the non-normative duplicated entry "danés" and
keep only the normative Central Catalan form "danès" in the daily pool; locate
the two adjacent entries "danès" and "danés" in the ca_daily_words list and
delete the "danés" line so only "danès" remains.
In `@webapp/data/languages/de/de_blocklist.txt`:
- Around line 27-32: The blocklist entries in de_blocklist.txt (e.g., "arsch",
"fotze", "nsdap") are ineffective because app.py bypasses this file when
de_daily_words.txt exists (it passes set()), and "nutte" remains in
de_daily_words.txt; to fix, remove all offensive words (including "nutte")
directly from de_daily_words.txt so they never appear in the daily-word pool,
and update app.py to pass the parsed blocklist (not an empty set) into the
daily-word selector or explicitly document/ensure the intended precedence
between de_daily_words.txt and de_blocklist.txt (search for usages of
de_daily_words.txt and the code path that passes set() in app.py to locate and
correct the behavior).
In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 1230: Remove the offensive word "nutte" from the daily German word pool
by deleting it from webapp/data/languages/de/de_daily_words.txt (line containing
"nutte") and add the same token to webapp/data/languages/de/de_blocklist.txt to
prevent future re-addition; ensure the blocklist entry matches the exact
casing/spelling used in de_daily_words.txt.
---
Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 355-359: The set comprehension building new_supplement redundantly
re-checks is_valid_word for items already filtered into valid_freq; update the
comprehension to only include words from valid_freq that are not in
existing_word_set (remove the is_valid_word(w, char_set) condition) so
new_supplement is computed as {w for w in valid_freq if w not in
existing_word_set}; keep the variable names new_supplement, valid_freq,
existing_word_set unchanged.
- Around line 466-484: The two networked git calls using subprocess.run in
scripts/improve_word_lists.py (the clone call and the sparse-checkout call)
should include a timeout to avoid hanging in CI; update both subprocess.run
invocations (the one cloning "https://github.com/hermitdave/FrequencyWords.git"
and the one running "git sparse-checkout set content/2018") to pass a timeout
parameter (e.g., timeout=300) and optionally wrap them in a try/except for
subprocess.TimeoutExpired to handle/log timeout failures cleanly.
In `@tests/conftest.py`:
- Around line 62-64: The list/set comprehensions in the helpers repeatedly call
line.strip() multiple times per iteration; change each comprehension to compute
the stripped line once using the walrus operator (e.g., assign stripped =
line.strip() in the comprehension's if/for expressions) and then use that single
variable for the empty-check, startswith guard, and the lowercased/result value
so you avoid redundant strip calls in the helper that returns the list (lines
around the shown comprehension) and the analogous helper at lines 73-75.
- Around line 35-41: load_word_list currently returns words without lowercasing,
causing inconsistency with load_daily_words which lowercases entries; update
load_word_list to normalize each returned word with .lower() (and apply the same
normalization to load_supplement_words) so all helper loaders consistently
return lowercase strings, ensuring subset checks like in TestDailyWords behave
deterministically; locate and modify the functions load_word_list and
load_supplement_words to map .lower() over stripped lines before returning.
In `@tests/test_word_lists.py`:
- Around line 340-359: Duplicate Roman numeral logic exists as _ROMAN_RE and
_is_roman_numeral in tests/test_word_lists.py and as _ROMAN_RE /
is_roman_numeral in scripts/improve_word_lists.py; refactor by extracting the
shared logic into a single utility (e.g., utils/roman.py) or by importing the
existing is_roman_numeral from scripts/improve_word_lists.py, then update tests
to call the shared function instead of the local _is_roman_numeral and remove
the duplicated _ROMAN_RE/_is_roman_numeral definitions so there is a single
source of truth.
wordfreq integration: - Add wordfreq library as second data source (Wikipedia, Reddit, Twitter, Google Books) to supplement FrequencyWords (OpenSubtitles) - Smart foreign-word filter: skip words more common in English than the target language (removes Wikipedia noise like tech terms, English words) - Massive supplement growth: Italian 3.5K→11.8K, French 3.1K→15.4K, Spanish 3.6K→14K, German 3.4K→9.9K, Esperanto 0.7K→20.3K Keyboard layouts for 17 remaining languages: - QWERTY: ga, gd, fy, ia, ie, la, rw, qya, tlh, az, ltg, tk, fur - QWERTZ: lb, nds - AZERTY: oc - Cyrillic: mn (ЙЦУКЕН-based) Data quality fixes: - Portuguese: lowercase 24 proper nouns, remove 10 duplicates, add to blocklist - Latin: remove 35 abbreviations with periods, clean character set - xfail updates for la, az, oc, qya (word list quality issues)
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
webapp/data/languages/az/az_keyboard.json (1)
1-13: Overall structure and Azerbaijani letter coverage look correct.The JSON is valid, the
"default"key matches the layout name, navigation keys (⇨/⌫) are in place, and all 32 Azerbaijani Latin letters are represented (pending removal of"w"per the comment above). One optional nit: the"label"value"QWERTY"could be"QÜERTY"to match the official name of the Azerbaijani layout, but this is cosmetic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@webapp/data/languages/az/az_keyboard.json` around lines 1 - 13, Update the Azerbaijani layout's label to the official spelling by changing the "label" value inside the "azerbaijani_qwerty" layout from "QWERTY" to "QÜERTY"; locate the "azerbaijani_qwerty" object and edit its "label" property accordingly (optionally, also follow up on the earlier note about removing the "w" key from the first row if that upstream change is applied).tests/test_language_config.py (1)
156-156: AnnotateKEYBOARD_COVERAGE_XFAILwithClassVarto silence Ruff RUF012Ruff flags mutable class-level attributes that aren't annotated as
ClassVar. The same pattern applies toKEYBOARD_COVERAGE_XFAILhere and intest_word_lists.py.♻️ Proposed fix
+from typing import ClassVar class TestKeyboardConfig: # Languages with known keyboard coverage gaps (complex scripts, incomplete keyboards) - KEYBOARD_COVERAGE_XFAIL = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"} + KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_language_config.py` at line 156, KEYBOARD_COVERAGE_XFAIL is a mutable class-level attribute flagged by Ruff RUF012; annotate it as a ClassVar to indicate it is not an instance attribute. Update the declaration of KEYBOARD_COVERAGE_XFAIL (and the same variable in test_word_lists.py) to use typing.ClassVar, e.g. ClassVar[set[str]] or ClassVar[FrozenSet[str]] if you prefer immutability, and add the necessary import for ClassVar at the top of each file.tests/test_word_lists.py (2)
337-382:_is_roman_numeralduplicatesis_roman_numeralfromscripts/improve_word_lists.py
TestDailyWordQuality._is_roman_numeral(lines 342–359) is a near-verbatim copy ofis_roman_numeralinscripts/improve_word_lists.py(lines 225–246). Any future fix to the logic (e.g., correcting subtractive-pair validation) must be applied in both places.If the test harness cannot import from
scripts/, consider extracting the function into a shared utility module (e.g.,webapp/utils.pyor a dedicatedwordle_utilspackage) that both the test and the script can import from.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_word_lists.py` around lines 337 - 382, TestDailyWordQuality._is_roman_numeral duplicates the logic in scripts.improve_word_lists.is_roman_numeral; extract the Roman-numeral checker into a single shared utility (e.g., create function is_roman_numeral in webapp.utils or a new wordle_utils module) and replace the copy in tests/test_word_lists.py by importing that shared is_roman_numeral, and update scripts/improve_word_lists.py to import the same shared function instead of defining its own; ensure the test class references the imported function (or calls the shared is_roman_numeral) and remove the _is_roman_numeral method to eliminate duplication.
145-145: AnnotateKEYBOARD_COVERAGE_XFAILwithClassVarto silence Ruff RUF012Identical pattern to
test_language_config.pyline 156.♻️ Proposed fix
+from typing import ClassVar class TestKeyboardCoverage: - KEYBOARD_COVERAGE_XFAIL = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"} + KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_word_lists.py` at line 145, The module-level constant KEYBOARD_COVERAGE_XFAIL should be annotated as a ClassVar to satisfy Ruff RUF012: add a typing import for ClassVar (and Set if not already imported) and change the declaration of KEYBOARD_COVERAGE_XFAIL to something like KEYBOARD_COVERAGE_XFAIL: ClassVar[Set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"} so the static analyzer recognizes it as a class/module-level constant.scripts/improve_word_lists.py (1)
586-604: Ruff S607: use the full path togitinstead of a bare executable namePassing a partial executable path to
subprocess.runis flagged by Ruff S607. For a developer tool this is low risk, but usingshutil.which("git")to resolve the full path eliminates the PATH-injection vector.♻️ Proposed fix
+import shutil + def download_frequency_words(): ... + git_path = shutil.which("git") + if not git_path: + raise RuntimeError("git not found in PATH") subprocess.run( - ["git", "clone", "--depth", "1", "--filter=blob:none", "--sparse", - "https://github.com/hermitdave/FrequencyWords.git"], + [git_path, "clone", "--depth", "1", "--filter=blob:none", "--sparse", + "https://github.com/hermitdave/FrequencyWords.git"], cwd=target, check=True, ) subprocess.run( - ["git", "sparse-checkout", "set", "content/2018"], + [git_path, "sparse-checkout", "set", "content/2018"], cwd=repo_dir, check=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 586 - 604, Replace bare "git" invocations passed to subprocess.run with the full path resolved via shutil.which("git") to avoid PATH-injection; in scripts/improve_word_lists.py obtain git_path = shutil.which("git") near where subprocess.run is called, validate git_path is not None (raise an informative exception if it is), and then use [git_path, "clone", ...] and [git_path, "sparse-checkout", "set", "content/2018"] instead of the string "git"; keep existing cwd and check=True arguments and reuse variables repo_dir and target to locate the calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 12: The pyproject currently lists the package "wordfreq" under production
dependencies but it's only imported via "import wordfreq" in an optional script
and not used by the webapp; move "wordfreq" out of the main dependencies and add
it to the dev (or development) dependency group in pyproject.toml so it is
installed only for development/test builds, leaving the try/except usage in the
script intact and ensuring no runtime imports in the webapp (e.g., module
webapp.app) depend on it.
In `@webapp/data/languages/az/az_keyboard.json`:
- Line 7: Remove the spurious "w" entry from the first keyboard row array (the
row currently ["q","w","e","ə","r","t","y","u","ü","i","ı","o","ö","p"]) so the
row matches the Azerbaijani QÜERTY layout (keep "ü" in place and do not add any
other characters); locate the array in az_keyboard.json (look for the array
starting with "q") and delete only the "w" string so the keyboard letters match
the official Azerbaijani 32-letter set.
In `@webapp/data/languages/da/da_5words_supplement.txt`:
- Around line 1-2798: The supplement file da_5words_supplement.txt includes
OCR-corrupted entries (e.g., "haiio","hjaip","iiget","ldiot","roiig") that pass
is_valid_word() because characters are alphabetic; add a post-generation
blocklist/deny-list pass after is_valid_word() that filters known-bad patterns
(e.g., sequences of >=3 identical vowels like "iii", suspicious repeated 'i'/'l'
runs, common l/I confusion patterns like leading 'l' where 'i' expected) or load
a curated per-language deny-list to remove these entries from the FrequencyWords
output; implement this as a small filter function (e.g.,
filter_bad_ocr_artifacts()) invoked before finalizing the supplement so the
listed examples are excluded.
---
Duplicate comments:
In `@scripts/improve_word_lists.py`:
- Around line 744-747: The current post-call handling of process_language only
exits non-zero on result["status"] == "error", so skipped languages
(result["status"] == "skipped") return 0; change the logic in the block after
process_language (the code referencing result, result["status"], and printing
the message) to treat "skipped" as a failure for CI by printing the
warning/error to stderr and calling sys.exit(1) when result["status"] is not
"ok" (or explicitly when it's "skipped" or "error"), keeping the same message
logic via result.get("reason") so callers receive a non-zero exit code on
skipped/unknown languages.
- Around line 79-80: Change EXCLUDE from a set to a dict mapping language code
to an exclusion reason (e.g., EXCLUDE = {"en":"already high quality",
"ko":"syllable blocks produce zero 5-letter matches", ...}), then update
process_language to check EXCLUDE.get(lang) and set result["reason"] to the dict
value (not the hardcoded "excluded (already high quality)"); finally, remove
"ko" from the batch_process priority list so it isn't listed as "skipped" in
summaries. This touches the EXCLUDE constant, the process_language function
where result["reason"] is assigned, and the batch_process priority list.
In `@tests/test_word_lists.py`:
- Line 242: The constant SUPPLEMENT_OVERLAP_XFAIL should be annotated with
ClassVar to satisfy RUF012; update its definition in tests/test_word_lists.py to
include a ClassVar type annotation (e.g., SUPPLEMENT_OVERLAP_XFAIL:
ClassVar[set[str]] = {"pl", "ckb"}) and ensure ClassVar is imported from typing
at the top of the module (add "from typing import ClassVar" if missing).
---
Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 586-604: Replace bare "git" invocations passed to subprocess.run
with the full path resolved via shutil.which("git") to avoid PATH-injection; in
scripts/improve_word_lists.py obtain git_path = shutil.which("git") near where
subprocess.run is called, validate git_path is not None (raise an informative
exception if it is), and then use [git_path, "clone", ...] and [git_path,
"sparse-checkout", "set", "content/2018"] instead of the string "git"; keep
existing cwd and check=True arguments and reuse variables repo_dir and target to
locate the calls.
In `@tests/test_language_config.py`:
- Line 156: KEYBOARD_COVERAGE_XFAIL is a mutable class-level attribute flagged
by Ruff RUF012; annotate it as a ClassVar to indicate it is not an instance
attribute. Update the declaration of KEYBOARD_COVERAGE_XFAIL (and the same
variable in test_word_lists.py) to use typing.ClassVar, e.g. ClassVar[set[str]]
or ClassVar[FrozenSet[str]] if you prefer immutability, and add the necessary
import for ClassVar at the top of each file.
In `@tests/test_word_lists.py`:
- Around line 337-382: TestDailyWordQuality._is_roman_numeral duplicates the
logic in scripts.improve_word_lists.is_roman_numeral; extract the Roman-numeral
checker into a single shared utility (e.g., create function is_roman_numeral in
webapp.utils or a new wordle_utils module) and replace the copy in
tests/test_word_lists.py by importing that shared is_roman_numeral, and update
scripts/improve_word_lists.py to import the same shared function instead of
defining its own; ensure the test class references the imported function (or
calls the shared is_roman_numeral) and remove the _is_roman_numeral method to
eliminate duplication.
- Line 145: The module-level constant KEYBOARD_COVERAGE_XFAIL should be
annotated as a ClassVar to satisfy Ruff RUF012: add a typing import for ClassVar
(and Set if not already imported) and change the declaration of
KEYBOARD_COVERAGE_XFAIL to something like KEYBOARD_COVERAGE_XFAIL:
ClassVar[Set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"} so
the static analyzer recognizes it as a class/module-level constant.
In `@webapp/data/languages/az/az_keyboard.json`:
- Around line 1-13: Update the Azerbaijani layout's label to the official
spelling by changing the "label" value inside the "azerbaijani_qwerty" layout
from "QWERTY" to "QÜERTY"; locate the "azerbaijani_qwerty" object and edit its
"label" property accordingly (optionally, also follow up on the earlier note
about removing the "w" key from the first row if that upstream change is
applied).
- Move wordfreq to dev dependencies (only used in scripts) - Remove docs/DICTIONARY_IMPROVEMENT_PLAN.md from repo - Remove 'w' from Azerbaijani keyboard (not in Azerbaijani alphabet) - Remove danés from Catalan daily words (keep danès, the Catalan form) - Remove huren/neger/negro from German daily words + add to blocklist - Remove 20 OCR-corrupted entries from Danish supplement - Remove duplicate jones from common_names.txt - Fix README heading level (### → ##) - Fix ko exclusion reason and process command exit code - Add ClassVar type annotation for SUPPLEMENT_OVERLAP_XFAIL
- Remove 'dəng:' (colon artifact) from main word list - Lowercase 3 proper nouns in supplement (Ağyol, Allah, Quran) - Remove az from CHARACTER_XFAIL and KEYBOARD_COVERAGE_XFAIL - 5 more tests passing (1972 passed, 28 xfailed)
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Data quality fixes: - Portuguese: remove 29 invalid entries (whitespace, periods, asterisks), complete diacritic_map for keyboard coverage - Palauan: lowercase 24 proper nouns, remove 222 duplicates, 1 non-alpha word - Quenya: remove 7 words with punctuation (!, -) - Occitan: NFC-normalize combining accents, remove 1 resulting wrong-length word - Polish: remove 2,406 supplement/main overlaps - Kurdish: remove 990 supplement/main overlaps + 1 wrong-length word Keyboard/diacritic fixes: - Greek: add diacritic_map (ά→α, έ→ε, etc.) - Vietnamese: add comprehensive diacritic_map (59 tonal variants) - Latin: remove from xfail (already passing) Result: 2004 passed, 289 skipped, 2 xfailed (Korean only)
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
scripts/improve_word_lists.py (2)
586-604: Ruff S607: subprocess invoked with partial executable path"git"Ruff flags both
subprocess.runcalls as S607 because"git"is resolved viaPATH, which could be manipulated in non-trusted environments. For a dev script this is low-risk, but the fix is trivial usingshutil.which:♻️ Proposed fix
+import shutil + +_GIT = shutil.which("git") or "git" subprocess.run( [ - "git", + _GIT, "clone", ... ], ... ) subprocess.run( - ["git", "sparse-checkout", "set", "content/2018"], + [_GIT, "sparse-checkout", "set", "content/2018"], ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` around lines 586 - 604, The subprocess.run calls that invoke "git" (the two occurrences in scripts/improve_word_lists.py using subprocess.run([...], cwd=target, ...) and subprocess.run([...], cwd=repo_dir, ...)) should resolve the git executable with shutil.which and use that absolute path to avoid PATH injection; locate where "git" is passed as the first element of the argv lists, call shutil.which("git"), raise a clear exception if None, and replace the literal "git" entries with the returned path for both clone and sparse-checkout subprocess.run invocations.
491-491: Ruff RUF003: ambiguous Unicodeσcharacter in source commentThe literal
σ(U+03C3, GREEK SMALL LETTER SIGMA) in the comment can render inconsistently across editors and is flagged by Ruff. Spell it out or use the Unicode escape:♻️ Proposed fix
- # Greek: normalize final sigma (σ at word end → ς) + # Greek: normalize final sigma (U+03C3 at word end → U+03C2 final sigma)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/improve_word_lists.py` at line 491, Summary: the inline comment contains the literal Greek small letter sigma (σ) which Ruff flags as ambiguous; update the comment to avoid the direct Unicode glyph. Fix: edit the comment that reads like "# Greek: normalize final sigma (σ at word end → ς)" and replace the literal σ with either the spelled-out word "sigma" or the Unicode escape sequence "\u03C3" (e.g. "# Greek: normalize final sigma (sigma at word end → ς)" or "# Greek: normalize final sigma (\u03C3 at word end → ς)"); keep the rest of the comment intact and ensure the chosen form matches project linting preferences.tests/test_language_config.py (1)
156-156: Ruff RUF012: mutablesetas a class attribute — addClassVarThe narrowing to
{"ko"}is a welcome improvement. However, Ruff (RUF012) flags the mutable set literal as a class attribute withoutClassVar. Add the annotation to silence the warning and clarify intent:♻️ Proposed fix
+from typing import ClassVar + class TestKeyboardConfig: # Languages with known keyboard coverage gaps (complex scripts, incomplete keyboards) - KEYBOARD_COVERAGE_XFAIL = {"ko"} + KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"ko"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_language_config.py` at line 156, KEYBOARD_COVERAGE_XFAIL is defined as a mutable set at class scope and triggers Ruff RUF012; update its annotation to be a ClassVar to indicate it’s not an instance attribute (e.g., annotate KEYBOARD_COVERAGE_XFAIL as ClassVar[Set[str]]), and add the necessary imports (ClassVar and Set) to the module if missing; keep the assigned value as {"ko"} but change the type to ClassVar[Set[str]] to silence the linter and clarify intent.tests/test_daily_word.py (1)
149-149: Annotate withClassVarto match the fix already applied intest_word_lists.py.Ruff RUF012 flags this mutable class-level default. The same issue was already addressed for
SUPPLEMENT_OVERLAP_XFAILintest_word_lists.pyby adding aClassVarannotation — apply the same pattern here for consistency.🔧 Proposed fix
+from typing import ClassVar + class TestWordCycling: """Tests for word list cycling behavior.""" # Languages with duplicate words causing early repeats - REPEAT_XFAIL: set[str] = set() + REPEAT_XFAIL: ClassVar[set[str]] = set()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_daily_word.py` at line 149, The class-level mutable default REPEAT_XFAIL should be annotated as a ClassVar like the similar fix in test_word_lists.py to silence RUF012; update the declaration of REPEAT_XFAIL to use ClassVar[set[str]] = set() and add/import ClassVar from typing at the top of tests/test_daily_word.py if it isn't already present so the mutable set is explicitly a class variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Around line 78-79: Remove the non-standard duplicate "aprés" (with acute
accent) so only the normative "après" (with grave accent) remains in the word
list; locate the two adjacent entries "après" and "aprés" and delete the "aprés"
entry to avoid rejecting correct guesses that use the normative form.
In `@webapp/data/languages/da/da_5words_supplement.txt`:
- Line 490: Remove the four offensive entries ("cunth", "negro", "nigga",
"pussy") from the Danish supplement file and add them to the Danish blocklist:
delete these exact tokens from da_5words_supplement.txt and append them (one per
line, matching existing blocklist formatting and avoiding duplicates) to
da_blocklist.txt so the blocklist infrastructure will catch them.
- Line 1055: Remove the listed OCR-corrupted entries from
webapp/data/languages/da/da_5words_supplement.txt (replace or delete: hjæip →
remove, iader, iagde, iangt, iaver, ieder, iever, iugte, iukke, iyder, iykke,
iyset, iyver, iægen, iægge, iænge, iærer, iøber, iøbet) and delete the four
offensive entries (cunth, negro, nigga, pussy); then update
scripts/improve_word_lists.py to extend the OCR filter (the validation function
that currently filters OCR artifacts—e.g., is_valid_word/filter_word) to reject
words matching the new regexes ^i[yæø] and ^ia so these corrupted l→i forms are
filtered out going forward, and add the four offensive words to
webapp/data/languages/da/da_blocklist.txt.
In `@webapp/data/languages/de/de_blocklist.txt`:
- Around line 27-36: The de_blocklist.txt offensive section is missing four
German offensive terms that remain in de_daily_words.txt; update
de_blocklist.txt (offensive/inappropriate section) to include "nutte", "dildo",
"dirne", and "tunte" so those tokens are blocked by the general blocklist used
outside app.py, ensuring defense-in-depth across code paths (referenced files:
de_blocklist.txt, de_daily_words.txt, and app.py).
In `@webapp/data/languages/de/de_daily_words.txt`:
- Around line 341-345: de_daily_words.txt currently contains offensive German
terms and is bypassing the blocklist, so remove the listed entries from the
daily word file and add them to the blocklist: remove "dildo", "dirne", "fürze",
"nutte", and "tunte" from de_daily_words.txt and append them to de_blocklist.txt
(one per line); verify the daily-word selector now receives the actual blocklist
set rather than an empty set after this change and run the selector to confirm
those words are no longer served.
---
Duplicate comments:
In `@tests/test_word_lists.py`:
- Around line 30-32: Add a ClassVar annotation for each mutable class-level set
to satisfy Ruff RUF012: import ClassVar from typing, then change the type of
each class attribute like LOWERCASE_XFAIL, DUPLICATE_XFAIL,
SUPPLEMENT_LENGTH_XFAIL, SUPPLEMENT_OVERLAP_XFAIL and the other three set[str]
attributes to ClassVar[set[str]] so the mutable sets are explicitly marked as
class variables.
---
Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 586-604: The subprocess.run calls that invoke "git" (the two
occurrences in scripts/improve_word_lists.py using subprocess.run([...],
cwd=target, ...) and subprocess.run([...], cwd=repo_dir, ...)) should resolve
the git executable with shutil.which and use that absolute path to avoid PATH
injection; locate where "git" is passed as the first element of the argv lists,
call shutil.which("git"), raise a clear exception if None, and replace the
literal "git" entries with the returned path for both clone and sparse-checkout
subprocess.run invocations.
- Line 491: Summary: the inline comment contains the literal Greek small letter
sigma (σ) which Ruff flags as ambiguous; update the comment to avoid the direct
Unicode glyph. Fix: edit the comment that reads like "# Greek: normalize final
sigma (σ at word end → ς)" and replace the literal σ with either the spelled-out
word "sigma" or the Unicode escape sequence "\u03C3" (e.g. "# Greek: normalize
final sigma (sigma at word end → ς)" or "# Greek: normalize final sigma (\u03C3
at word end → ς)"); keep the rest of the comment intact and ensure the chosen
form matches project linting preferences.
In `@tests/test_daily_word.py`:
- Line 149: The class-level mutable default REPEAT_XFAIL should be annotated as
a ClassVar like the similar fix in test_word_lists.py to silence RUF012; update
the declaration of REPEAT_XFAIL to use ClassVar[set[str]] = set() and add/import
ClassVar from typing at the top of tests/test_daily_word.py if it isn't already
present so the mutable set is explicitly a class variable.
In `@tests/test_language_config.py`:
- Line 156: KEYBOARD_COVERAGE_XFAIL is defined as a mutable set at class scope
and triggers Ruff RUF012; update its annotation to be a ClassVar to indicate
it’s not an instance attribute (e.g., annotate KEYBOARD_COVERAGE_XFAIL as
ClassVar[Set[str]]), and add the necessary imports (ClassVar and Set) to the
module if missing; keep the assigned value as {"ko"} but change the type to
ClassVar[Set[str]] to silence the linter and clarify intent.
- Remove 7 dead xfail sets and their check blocks (all resolved) - Deduplicate is_roman_numeral(): tests now import from script - Fix docstring placement in test_keyboard_covers_all_word_characters - Fix is_likely_foreign() fallback: return False for unmapped languages instead of using lang code directly (could silently filter all words) - Use final_form_map config instead of hardcoded Greek sigma normalization - Fix SOURCES.md template: don't claim Hunspell for all languages, mention wordfreq when used - Add comment explaining FREQ_LANG_MAP vs WORDFREQ_LANG_MAP differences
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
- Catalan: remove non-standard aprés (keep normative après) - Danish: remove 19 more OCR l→i artifacts + 4 offensive English words - Danish: add cunth/negro/nigga/pussy to blocklist - German: remove dildo/dirne/fürze/nutte/tunte from daily words - German: add dildo/dirne/fürze/nutte/tunte to blocklist
Blocklist should only prevent words from being daily answers, not from being valid guesses. Players should be able to type any real word.
feat: Frequency-ranked daily words and supplements for 38 languages
Summary
Major word list and keyboard quality overhaul for 38+ languages.
Word list improvements
Keyboard layouts
Diacritic maps
Data quality fixes
Code quality
is_roman_numeral()— tests import from script instead of copyfinal_form_mapconfig instead of hardcoded Greek sigma normalizationis_likely_foreign()fallback bug (could silently filter all words)wordfreqmoved to dev dependencies (only used in scripts)Infrastructure
Test results
Test plan
uv run pytest tests/— all passpnpm test— all pass