Skip to content

feat: Frequency-ranked daily words and supplements for 38 languages#118

Merged
Hugo0 merged 10 commits into
mainfrom
feat/frequency-word-lists
Feb 21, 2026
Merged

feat: Frequency-ranked daily words and supplements for 38 languages#118
Hugo0 merged 10 commits into
mainfrom
feat/frequency-word-lists

Conversation

@Hugo0
Copy link
Copy Markdown
Owner

@Hugo0 Hugo0 commented Feb 21, 2026

Summary

Major word list and keyboard quality overhaul for 38+ languages.

Word list improvements

  • FrequencyWords integration: Daily word ranking and supplement generation using subtitle frequency data for 38 languages
  • wordfreq integration: Additional valid guesses from Wikipedia, Reddit, Twitter, Google Books corpus data
  • Smart foreign-word filter: Rejects words that are more common in English than the target language (removes Wikipedia noise)
  • Blocklist system: Per-language blocklists + global common names list to filter proper nouns from daily words
  • Roman numeral filter: Prevents Roman numerals (e.g., xviii) from being daily answers

Keyboard layouts

  • Created proper keyboard layouts for 17 languages (was auto-generated alphabetical)
  • QWERTY, QWERTZ, AZERTY, and Cyrillic layouts matched to regional conventions
  • Fixed Breton keyboard (added ê, ñ, ù)
  • Fixed Azerbaijani keyboard (removed spurious 'w', cleaned word data)

Diacritic maps

  • Greek: added diacritic_map for accented vowels (ά→α, etc.)
  • Vietnamese: comprehensive diacritic_map for 59 tonal variants
  • Portuguese: completed diacritic_map with rare chars (è, î, ö, ü, ũ, ẽ)

Data quality fixes

  • Portuguese: removed 29 invalid entries (whitespace, periods, asterisks, multi-word)
  • Palauan: lowercased 24 proper nouns, removed 222 duplicates
  • Azerbaijani: removed colon artifact, lowercased proper nouns, removed 'w' from keyboard
  • German: added offensive words to blocklist, removed from daily rotation
  • Catalan: removed duplicate danés variant (kept danès)
  • Danish: removed 20 OCR-corrupted entries from supplement
  • Latin: removed 35 abbreviations containing periods
  • Quenya: removed 7 words with punctuation
  • Occitan: NFC-normalized combining accents
  • Polish: removed 2,406 supplement/main overlaps
  • Kurdish: removed 990 supplement/main overlaps + 1 wrong-length word

Code quality

  • Deduplicated is_roman_numeral() — tests import from script instead of copy
  • Removed 7 dead xfail sets (all data issues resolved)
  • Used final_form_map config instead of hardcoded Greek sigma normalization
  • Fixed is_likely_foreign() fallback bug (could silently filter all words)
  • Fixed SOURCES.md template (don't claim Hunspell for all languages)
  • wordfreq moved to dev dependencies (only used in scripts)

Infrastructure

  • Pipeline improvements: overwrite guards, encoding fixes, better error handling
  • Removed dead code and stale documentation

Test results

  • 2004 passed, 289 skipped, 2 xfailed (Korean only — Hangul jamo encoding issue)
  • 54 TypeScript tests passed
  • Down from 28 xfails to 2

Test plan

  • uv run pytest tests/ — all pass
  • pnpm test — all pass
  • Spot-checked supplement quality for Italian, French, Spanish, German
  • Verified no offensive words in German daily rotation
  • All 17 new keyboards pass character coverage tests
  • Self-review: fixed DRY violations, dead code, docstring placement, bug in foreign word filter

Use FrequencyWords (OpenSubtitles frequency data) to generate:
- daily_words.txt: top 2,000 most common words from existing word lists
- supplement.txt: thousands of additional valid 5-letter guesses

This addresses the core UX problem: 53% of all guesses across all languages
are rejected as invalid. Languages like Italian (120% invalid rate), French
(75%), and Spanish (67%) should see dramatic improvement.

Languages processed: ar, br, ca, cs, da, de, el, eo, es, et, eu, fa, fr,
gl, he, hr, hu, hy, hyw, is, it, ka, lt, lv, mk, nb, nn, nl, pt, ro, ru,
sk, sl, sr, sv, tr, uk, vi

Excluded (already high quality): en, fi, pl, bg, ko
Not covered (no frequency data): az, ckb, fo, fur, fy, ga, gd, ia, ie,
lb, ltg, mi, mn, nds, ne, oc, pau, qya, rw, tk, tlh
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new CLI script that ingests FrequencyWords and wordfreq data to generate per-language daily and supplement 5‑letter lists; introduces generated language data and SOURCES, test helpers and new daily-word tests, keyboard/blocklist updates, a common-names filter, README edits, and a .gitignore entry for FrequencyWords cache.

Changes

Cohort / File(s) Summary
Ignore Rules
\.gitignore
Adds header for FrequencyWords data and ignores scripts/.freq_data/.
Word-list tooling & data
scripts/improve_word_lists.py, scripts/common_names.txt
New CLI script to download/process FrequencyWords + wordfreq, validate/filter 5‑letter words, apply blocklists/common names, produce {lang}_daily_words.txt and {lang}_5words_supplement.txt; adds common names filter file.
Tests
tests/conftest.py, tests/test_word_lists.py, tests/test_daily_word.py, tests/test_language_config.py
Adds test helpers load_daily_words() and load_blocklist(); new TestDailyWords and TestDailyWordQuality tests (subset, uniqueness, 5‑letter, blocklist, roman numeral checks); narrows keyboard XFAILs and removes some xfails.
Generated language data & sources
webapp/data/languages/...
webapp/data/languages/{ar,br,ca,cs,da,de,el,eo}/SOURCES.md, .../{lang}_daily_words.txt, .../{lang}_5words_supplement.txt
Adds SOURCES.md, ~2000-line {lang}_daily_words.txt files and large {lang}_5words_supplement.txt files for multiple languages (Arabic, Breton, Catalan, Czech, Danish, German, Greek, Esperanto, etc.).
Keyboard & blocklist updates
webapp/data/languages/br/br_keyboard.json, webapp/data/languages/az/az_keyboard.json, webapp/data/languages/de/de_blocklist.txt
Adds Breton and Azerbaijani keyboard layouts; augments German blocklist with offensive and political acronyms.
Docs & README
docs/DICTIONARY_IMPROVEMENT_PLAN.md, README.md
Deletes or replaces DICTIONARY_IMPROVEMENT_PLAN.md; README updated with structured "Adding a new language" guidance and general cleanup.
App minor cleanup
webapp/app.py
Removed commented fallback checks (comment-only removal; no behavior change).
Project config
pyproject.toml
Adds wordfreq>=3.1.1 to dev dependencies.
Other small data edits
webapp/data/languages/az/az_5words.txt, webapp/data/languages/az/az_5words_supplement.txt, webapp/data/languages/br/br_keyboard.json
EOF / newline normalization and case-normalization for some AZ supplement entries; Breton keyboard replaced empty array with layout.
Readme/badges & credits
README.md
Removes badges, reorganizes data-sourcing and contributor notes, adds TODO and explicit language-add steps.

Sequence Diagram

sequenceDiagram
    participant User as User / CLI
    participant Script as improve_word_lists.py
    participant FS as File System
    participant Repo as FrequencyWords (GitHub)
    participant Wordfreq as wordfreq corpus

    User->>Script: run download / process / batch
    Script->>FS: check cache `scripts/.freq_data/`
    alt cache miss
        Script->>Repo: fetch selected language frequency files
        Repo-->>FS: write frequency files
    end
    Script->>FS: load language chars, main word list, blocklist, common names, existing supplement
    Script->>Wordfreq: query/extract candidate words
    Script->>Script: filter valid 5‑letter words, remove blocklist/names/roman numerals, score by frequency
    Script->>FS: write `{lang}_daily_words.txt`, `{lang}_5words_supplement.txt`, create/patch `SOURCES.md`
    Script-->>User: print summary (counts, matches, stats)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I hopped through lists both wide and bright,
Gathered ranks by lantern-light,
Filtered names and nudged out Roman lore,
Seeded five-letter doors galore.
A tiny hop — new words take flight ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main objective: adding frequency-ranked daily words and supplements for 38 languages. It directly aligns with the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 87.88% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/frequency-word-lists

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (3)
scripts/improve_word_lists.py (2)

262-293: Template hardcodes "wooorm/dictionaries (Hunspell)" as the base source for all languages.

Not all languages source from wooorm/dictionaries. Consider adding a per-language source map, or at minimum noting in the template that the source should be verified. Also, write_text on line 292 should use encoding="utf-8".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 262 - 293, The SOURCES.md
template currently hardcodes "wooorm/dictionaries (Hunspell)" for every language
and writes with write_text(sources_md) without an explicit encoding; update the
generation to (1) look up a per-language source via a new mapping (e.g.,
SOURCES_MAP) using LANG_NAMES.get(lang, lang) as fallback and include that
source string in sources_md (or if missing, include a clear "verify source for
this language" note), and (2) call sources_path.write_text(sources_md,
encoding="utf-8") to ensure proper encoding when writing the file; change
references in the block that build sources_md and the write_text call
accordingly (use sources_path, LANG_NAMES.get(lang, lang), and sources_md to
locate where to modify).

311-327: Ruff S607: subprocess.run with partial executable path "git".

This is flagged by static analysis but is standard practice for dev/build scripts. If you want to suppress the warning, you could add a # noqa: S607 comment, or resolve git via shutil.which("git") with a clear error if not found.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 311 - 327, The subprocess.run
calls use the bare "git" executable which triggers S607; modify the script to
resolve git via shutil.which("git") (and raise a clear error if None) and use
that absolute path in the two subprocess.run invocations that clone and run
sparse-checkout (the calls referencing target, repo_dir and the "git" args), or
alternatively append a "# noqa: S607" on those subprocess.run lines if you
prefer to suppress the warning; ensure both occurrences are updated
consistently.
tests/test_word_lists.py (1)

283-296: test_supplement_disjoint_from_main doesn't test daily words and belongs in TestWordListBasics

This test checks supplement ∩ main = ∅ and never references daily words. Placing it in TestDailyWords is misleading and makes the class semantics unclear. It should either live in TestWordListBasics alongside the other word-list invariant tests, or be renamed/reworked to test daily ∩ supplement = ∅ (which follows transitively but is worth an explicit assertion).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_word_lists.py` around lines 283 - 296, The test
test_supplement_disjoint_from_main is in the wrong test class (TestDailyWords)
because it asserts supplement ∩ main = ∅ and never touches daily words; move
this test method into TestWordListBasics (where other word-list invariants live)
or alternatively change its assertion to explicitly test daily ∩ supplement = ∅
using load_daily (so it actually exercises daily semantics). Update the class
container for test_supplement_disjoint_from_main (or rename/rewrite the test to
reference load_daily) so the test name and class accurately reflect what is
being validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/improve_word_lists.py`:
- Around line 62-83: The three loader functions load_characters, load_word_list,
and load_existing_supplement use path.read_text() without specifying an encoding
which can break on Windows locales; update each to read text with UTF-8 (e.g.,
path.read_text(encoding="utf-8") or open the file with encoding="utf-8") so they
match load_frequency_data and reliably handle diacritics and non-ASCII
characters.
- Around line 247-260: The overwrite guard only checks daily_path so
supplement_path can be overwritten silently; update the guard in the block that
checks overwrite to also check supplement_path.exists() and skip/return (set
result["status"]="skipped" and reason) if either path exists and overwrite is
False; when writing files, call daily_path.write_text(..., encoding="utf-8") and
supplement_path.write_text(..., encoding="utf-8") to match the read path; refer
to the variables/functions daily_path, supplement_path, overwrite, write_text,
daily_words and supplement_sorted to locate the changes.

In `@tests/conftest.py`:
- Around line 53-59: The test helper load_daily_words currently reads lines
verbatim which diverges from production: update the function to ignore lines
that start with '#' (comment lines) and normalize each word to lowercase using
.lower() before returning; locate the function load_daily_words in
tests/conftest.py and change the list comprehension/filtering so it skips blank
lines and lines where stripped value startswith '#' and returns each word as
stripped_value.lower().

In `@tests/test_word_lists.py`:
- Around line 235-296: Run Black to reformat this test file so CI passes: apply
the project's formatting (e.g., black --line-length 100 tests/test_word_lists.py
or black webapp/ tests/) and commit the resulting changes; this will adjust
formatting in class TestDailyWords and its methods
(test_daily_words_subset_of_main, test_daily_words_no_duplicates,
test_daily_words_are_5_letters, test_supplement_disjoint_from_main) and preserve
the SUPPLEMENT_OVERLAP_XFAIL constant while ensuring the file matches the
repository's Black style.
- Line 239: Annotate the class-level mutable set SUPPLEMENT_OVERLAP_XFAIL with
ClassVar to satisfy Ruff RUF012: add an import for ClassVar (and Set if using
typing.Set) and change the declaration to something like
SUPPLEMENT_OVERLAP_XFAIL: ClassVar[Set[str]] = {"pl", "ckb"} (or
ClassVar[set[str]] if using Py3.9+ built-ins) so the attribute is explicitly a
class variable and the linter stops flagging it.

In `@webapp/data/languages/ar/SOURCES.md`:
- Around line 5-7: The SOURCES.md template generation in improve_word_lists.py
currently inserts a wooorm/dictionaries (Hunspell) attribution for all
languages; update the template generation code that produces SOURCES.md to
either (a) pull a per-language source from the language metadata (e.g., a
language-specific dict or JSON entry) when available, or (b) explicitly mark the
attribution as a default placeholder (e.g., "Default placeholder: verify
source") when no per-language source exists; locate the template/rendering block
in improve_word_lists.py responsible for creating SOURCES.md and modify it to
prefer per-language source data and fall back to the placeholder text.

In `@webapp/data/languages/br/br_daily_words.txt`:
- Around line 240-259: The br_daily_words.txt contains many pure Roman numerals
(letters m,d,c,l,x,v,i) that should be removed; update
scripts/improve_word_lists.py to add a post-processing step that filters out any
candidate word where every character is in the set {m,d,c,l,x,v,i} and that
matches a Roman-numeral regex (e.g.
^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$ case-insensitive),
removing those entries before writing/regenerating br_daily_words.txt and ensure
the step runs as the final cleanup when building the daily-answer pool.

In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Line 1959: The word list contains Roman numerals like "xviii" that should be
excluded from the daily-words pool; update the corpus filtering logic (e.g., in
functions that load or select daily words such as loadFrequencyCorpus,
generateDailyWords, or selectDailyWord) to detect and remove Roman-numeral
patterns before selection — implement a regex-based filter for valid Catalan
words (reject strings matching Roman numeral regex like ^[ivxlcdmIVXLCDM]+$ and
variants with punctuation) and apply it where the ca_daily_words.txt entries are
normalized so numerals are never offered as daily answers.
- Around line 430-431: The list contains both variants "danès" and the
nonstandard "danés", which causes mismatches; remove the acute-accent entry
"danés" from the daily-answer candidates (or replace/alias it so only the
standard Central Catalan form "danès" is accepted) to ensure only the normative
form remains in the candidate pool and prevent rejection of correct guesses.

In `@webapp/data/languages/da/da_daily_words.txt`:
- Line 351: The daily words file contains offensive entries that bypass
blocklist filtering because _get_daily_word calls
get_daily_word_consistent_hash(self.daily_words, set(), ...) with an empty
blocklist; to fix, either (A) remove the offensive tokens from the
da/daily_words.txt source (remove entries like "dildo", "fisse", "kusse",
"pisse", "bitch"), or (B) modify _get_daily_word to apply the app's blocklist by
passing the actual blocklist set (instead of set()) into
get_daily_word_consistent_hash or by filtering self.daily_words with the
blocklist before selection; update load_daily_words/_get_daily_word to ensure
daily_words are filtered against the same blocklist used for the main word list.

In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 100: The daily-word pipeline allows de_daily_words.txt entries to bypass
blocklist filtering because app.py constructs/passes an empty blocklist (set())
into the daily-word selector; remove offensive entries from de_daily_words.txt
or, preferably, enforce blocklist filtering by applying de_blocklist.txt when
loading/selecting daily words—update the function that reads/selects daily words
(the daily-word loader/selector called in app.py where blocklist is passed) to
merge de_blocklist.txt into the blocklist instead of using set(), and filter out
any word present in de_blocklist.txt (e.g., apply a simple set difference or
contains check before accepting a daily word).

In `@webapp/data/languages/el/el_5words_supplement.txt`:
- Around line 312-315: Lines contain 15 Greek words using the non-final sigma
character 'σ' (U+03C3) instead of the final sigma 'ς' (U+03C2); replace all
occurrences for the listed words (e.g., εκτος -> εδώ: εκτός, εκτοσ -> εκτός,
καλως/καλωσ -> καλώς, μερες/μερεσ -> μέρες, μερος/μεροσ -> μέρος, μηνες/μηνεσ ->
μήνες, τελος/τελοσ -> τέλος, χωρις/χωρισ -> χωρίς and the other 8 single
violations) so they use final sigma 'ς' and delete the 7 duplicate entries that
are the same word differing only by σ vs ς (the pairs starting with εκτος/εκτοσ,
καλως/καλωσ, μερες/μερεσ, μερος/μεροσ, μηνες/μηνεσ, τελος/τελοσ, χωρις/χωρισ);
ensure all 15 affected lines are normalized to the correct final-sigma spelling
(U+03C2).

---

Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 262-293: The SOURCES.md template currently hardcodes
"wooorm/dictionaries (Hunspell)" for every language and writes with
write_text(sources_md) without an explicit encoding; update the generation to
(1) look up a per-language source via a new mapping (e.g., SOURCES_MAP) using
LANG_NAMES.get(lang, lang) as fallback and include that source string in
sources_md (or if missing, include a clear "verify source for this language"
note), and (2) call sources_path.write_text(sources_md, encoding="utf-8") to
ensure proper encoding when writing the file; change references in the block
that build sources_md and the write_text call accordingly (use sources_path,
LANG_NAMES.get(lang, lang), and sources_md to locate where to modify).
- Around line 311-327: The subprocess.run calls use the bare "git" executable
which triggers S607; modify the script to resolve git via shutil.which("git")
(and raise a clear error if None) and use that absolute path in the two
subprocess.run invocations that clone and run sparse-checkout (the calls
referencing target, repo_dir and the "git" args), or alternatively append a "#
noqa: S607" on those subprocess.run lines if you prefer to suppress the warning;
ensure both occurrences are updated consistently.

In `@tests/test_word_lists.py`:
- Around line 283-296: The test test_supplement_disjoint_from_main is in the
wrong test class (TestDailyWords) because it asserts supplement ∩ main = ∅ and
never touches daily words; move this test method into TestWordListBasics (where
other word-list invariants live) or alternatively change its assertion to
explicitly test daily ∩ supplement = ∅ using load_daily (so it actually
exercises daily semantics). Update the class container for
test_supplement_disjoint_from_main (or rename/rewrite the test to reference
load_daily) so the test name and class accurately reflect what is being
validated.

Comment thread scripts/improve_word_lists.py Outdated
Comment thread scripts/improve_word_lists.py
Comment thread tests/conftest.py Outdated
Comment thread tests/test_word_lists.py
Comment thread tests/test_word_lists.py Outdated
Comment thread webapp/data/languages/ca/ca_daily_words.txt Outdated
Comment thread webapp/data/languages/ca/ca_daily_words.txt Outdated
Comment thread webapp/data/languages/da/da_daily_words.txt
aroma
arosa
array
arsch
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and check the specific lines mentioned
if [ -f webapp/data/languages/de/de_daily_words.txt ]; then
  echo "=== File exists. Checking specified lines ===";
  echo "Line 100 (arsch):";
  sed -n '100p' webapp/data/languages/de/de_daily_words.txt;
  echo "Line 535 (fotze):";
  sed -n '535p' webapp/data/languages/de/de_daily_words.txt;
  echo "Line 1229 (nsdap):";
  sed -n '1229p' webapp/data/languages/de/de_daily_words.txt;
  echo "Line 1231 (nutte):";
  sed -n '1231p' webapp/data/languages/de/de_daily_words.txt;
else
  echo "File not found";
fi

Repository: Hugo0/wordle

Length of output: 199


🏁 Script executed:

# Check app.py for load_daily_words function and blocklist filtering
echo "=== Searching for load_daily_words and filtering logic in app.py ===" && \
rg -n "load_daily_words|blocklist|block_list|filter" webapp/app.py -A 5 -B 2

Repository: Hugo0/wordle

Length of output: 6567


🏁 Script executed:

# Search for any blocklist files
echo "=== Looking for blocklist files ===" && \
find webapp -type f \( -name "*blocklist*" -o -name "*block_list*" \) 2>/dev/null

Repository: Hugo0/wordle

Length of output: 1057


🏁 Script executed:

# Check the load_daily_words function implementation
echo "=== Searching for load_daily_words function definition ===" && \
ast-grep --pattern 'function load_daily_words($_) {
  $$$
}'

Repository: Hugo0/wordle

Length of output: 114


🏁 Script executed:

# Check if the offensive words are in the German blocklist
echo "=== Checking de_blocklist.txt for offensive words ===" && \
for word in "arsch" "fotze" "nsdap" "nutte"; do
  echo "Checking '$word':";
  grep -c "^${word}$" webapp/data/languages/de/de_blocklist.txt && echo "  FOUND in blocklist" || echo "  NOT in blocklist";
done

Repository: Hugo0/wordle

Length of output: 265


🏁 Script executed:

# Count words in both files to understand the scope
echo "=== File sizes ===" && \
wc -l webapp/data/languages/de/de_daily_words.txt && \
wc -l webapp/data/languages/de/de_blocklist.txt

Repository: Hugo0/wordle

Length of output: 168


🏁 Script executed:

# Check lines around the get_daily_word_consistent_hash call with daily_words
echo "=== Detailed view of daily_words priority handling ===" && \
sed -n '615,630p' webapp/app.py

Repository: Hugo0/wordle

Length of output: 711


Offensive/vulgar words in the daily answer pool with no blocklist protection

This file contains offensive German words—arsch (line 100), fotze (line 535), nutte (line 1231)—and sensitive content like nsdap (line 1229, Nazi party abbreviation). These words are not listed in de_blocklist.txt and will become valid daily answers without remediation.

The issue is architectural: when de_daily_words.txt exists, it bypasses all blocklist filtering. The code at lines 619–622 of app.py passes an empty blocklist (set()) to the daily word selector, deliberately excluding the protection applied to fallback word lists (line 624).

Either remove offensive entries from this file or apply downstream blocklist filtering to daily words before deployment.

Also applies to: 535-535, 1231-1231

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/languages/de/de_daily_words.txt` at line 100, The daily-word
pipeline allows de_daily_words.txt entries to bypass blocklist filtering because
app.py constructs/passes an empty blocklist (set()) into the daily-word
selector; remove offensive entries from de_daily_words.txt or, preferably,
enforce blocklist filtering by applying de_blocklist.txt when loading/selecting
daily words—update the function that reads/selects daily words (the daily-word
loader/selector called in app.py where blocklist is passed) to merge
de_blocklist.txt into the blocklist instead of using set(), and filter out any
word present in de_blocklist.txt (e.g., apply a simple set difference or
contains check before accepting a daily word).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — offensive words (huren, neger, negro) have been added to the German blocklist and removed from daily word rotation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@Hugo0 have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 55 seconds before sending another message.

Comment thread webapp/data/languages/el/el_5words_supplement.txt
…DME cleanup

- Fix Black formatting on test_word_lists.py (CI blocker)
- Add explicit encoding="utf-8" to all read_text()/write_text() calls
- Overwrite guard now checks both daily_words and supplement paths
- load_daily_words test helper now matches production (comment skip, .lower())
- Clean up README: add TODO section, update data sources, streamline intro
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
tests/test_word_lists.py (1)

280-293: test_supplement_disjoint_from_main is semantically misplaced in TestDailyWords.

This test validates the supplement-vs-main relationship, not anything about daily words. Placing it (along with SUPPLEMENT_OVERLAP_XFAIL) in TestDailyWords will confuse contributors looking for supplement tests in TestWordListBasics. Consider moving it there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_word_lists.py` around lines 280 - 293, The test
test_supplement_disjoint_from_main and its related constant
SUPPLEMENT_OVERLAP_XFAIL are misplaced in TestDailyWords; move both into the
TestWordListBasics test class (or the module scope where other supplement/main
list tests live) so the test lives with other word-list validation tests; update
imports/references if necessary (e.g., ensure load_supplement_words and
load_word_list are available) and run the test suite to confirm no name
collisions or xfail-scoping issues after relocating them.
scripts/improve_word_lists.py (3)

280-284: Redundant is_valid_word call in supplement comprehension.

valid_freq was already filtered by is_valid_word(w, char_set) at line 229, so every key is already a valid word. The second check in the set comprehension is dead code.

♻️ Proposed fix
-    new_supplement = {
-        w for w in valid_freq if w not in existing_word_set and is_valid_word(w, char_set)
-    }
+    new_supplement = {w for w in valid_freq if w not in existing_word_set}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 280 - 284, The set comprehension
building new_supplement redundantly re-checks is_valid_word for each w even
though valid_freq already contains only words that passed
is_valid_word(char_set) earlier; update the new_supplement expression to simply
include w for w in valid_freq if w not in existing_word_set (remove the
is_valid_word(w, char_set) predicate) so the dead code is eliminated and
behavior remains the same.

262-264: Use raise instead of assert for the daily-words safety invariant.

assert is a no-op when Python runs with -O (python -O scripts/improve_word_lists.py), silently allowing corrupt output to be written. Swap to an explicit raise so the guard is always active.

🛡️ Proposed fix
-    assert not invalid_daily, f"BUG: daily words not in _5words.txt: {invalid_daily[:5]}"
+    if invalid_daily:
+        raise ValueError(f"BUG: daily words not in _5words.txt: {invalid_daily[:5]}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 262 - 264, Replace the
assert-based guard that checks daily_words against existing_word_set with an
explicit exception raise so it cannot be disabled by Python optimization:
compute invalid_daily (as already done), and if invalid_daily is truthy, raise a
RuntimeError (or ValueError) with the same informative message (including
invalid_daily[:5]) instead of using assert; update the check around
invalid_daily, daily_words, and existing_word_set accordingly so the script
always fails loudly on this invariant violation.

381-399: Optional: resolve S607 partial-executable-path lint warning with shutil.which.

Both subprocess.run calls use the bare "git" name. For maximum reliability, use a fully qualified path for the executable; to search for an unqualified name on PATH, use shutil.which(). For a developer-only script the real-world risk is low, but this silences the Ruff S607 flag.

🔧 Proposed fix
+import shutil

+GIT = shutil.which("git") or "git"
+
 def download_frequency_words():
     ...
     subprocess.run(
         [
-            "git",
+            GIT,
             "clone",
             ...
         ],
         ...
     )
     subprocess.run(
-        ["git", "sparse-checkout", "set", "content/2018"],
+        [GIT, "sparse-checkout", "set", "content/2018"],
         ...
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 381 - 399, The subprocess.run
calls that currently use the bare "git" string (the clone call and the
sparse-checkout call) should use a resolved executable path via
shutil.which("git") instead: import shutil, call shutil.which("git") once (e.g.,
git_path), assert or raise a clear error if it returns None, and pass git_path
as the first element of the args lists for the two subprocess.run invocations
(keeping the existing arguments and check=True); update both occurrences that
reference "git" (the clone invocation and the ["git", "sparse-checkout", "set",
"content/2018"] invocation) to use the resolved git_path variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 16: The "Adding a new language" heading is currently an h3 (###) while
the previous heading is h1 ("# Wordle Global"), violating markdownlint MD001;
change the heading "### Adding a new language" to an h2 ("## Adding a new
language") so headings increment by one level (or alternatively insert an
intervening h2 if that fits the structure) to satisfy the rule.

In `@scripts/improve_word_lists.py`:
- Around line 536-542: The current handler for args.command == "process" only
treats result["status"] == "error" as a failure, so skipped/unknown languages
with result["status"] == "skipped" still exit 0; update the check after calling
process_language (the call and result variable) to treat any non-"ok" status as
a failure (or explicitly check for "skipped" and "error") and print the reason
to stderr and sys.exit(1); reference the process_language call and the
result["status"] field when making this change so automation receives a non-zero
exit code for skipped/unknown languages.
- Around line 192-194: The exclusion reason is hardcoded to "excluded (already
high quality)" which is incorrect for some languages (notably "ko" which is
excluded because FrequencyWords uses syllable blocks); change EXCLUDE from a set
to a mapping (e.g., EXCLUDE = {"en": "already high quality", "ko": "uses
syllable blocks", ...}) and update the exclusion branch that sets
result["status"] and result["reason"] to look up EXCLUDE[lang] (falling back to
a generic message if missing); also remove "ko" from the priority list (or
ensure the priority list only contains languages not present in EXCLUDE) so it
does not always print as skipped.

---

Duplicate comments:
In `@tests/test_word_lists.py`:
- Line 239: The class-level mutable set SUPPLEMENT_OVERLAP_XFAIL should be
annotated with ClassVar to indicate it's not an instance attribute; update its
declaration to include a ClassVar type (e.g., SUPPLEMENT_OVERLAP_XFAIL:
ClassVar[set[str]] = {"pl", "ckb"}) and ensure ClassVar (and typing Set/str
generics if your codebase prefers) is imported from typing at the top of
tests/test_word_lists.py so Ruff RUF012 is silenced.

---

Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 280-284: The set comprehension building new_supplement redundantly
re-checks is_valid_word for each w even though valid_freq already contains only
words that passed is_valid_word(char_set) earlier; update the new_supplement
expression to simply include w for w in valid_freq if w not in existing_word_set
(remove the is_valid_word(w, char_set) predicate) so the dead code is eliminated
and behavior remains the same.
- Around line 262-264: Replace the assert-based guard that checks daily_words
against existing_word_set with an explicit exception raise so it cannot be
disabled by Python optimization: compute invalid_daily (as already done), and if
invalid_daily is truthy, raise a RuntimeError (or ValueError) with the same
informative message (including invalid_daily[:5]) instead of using assert;
update the check around invalid_daily, daily_words, and existing_word_set
accordingly so the script always fails loudly on this invariant violation.
- Around line 381-399: The subprocess.run calls that currently use the bare
"git" string (the clone call and the sparse-checkout call) should use a resolved
executable path via shutil.which("git") instead: import shutil, call
shutil.which("git") once (e.g., git_path), assert or raise a clear error if it
returns None, and pass git_path as the first element of the args lists for the
two subprocess.run invocations (keeping the existing arguments and check=True);
update both occurrences that reference "git" (the clone invocation and the
["git", "sparse-checkout", "set", "content/2018"] invocation) to use the
resolved git_path variable.

In `@tests/test_word_lists.py`:
- Around line 280-293: The test test_supplement_disjoint_from_main and its
related constant SUPPLEMENT_OVERLAP_XFAIL are misplaced in TestDailyWords; move
both into the TestWordListBasics test class (or the module scope where other
supplement/main list tests live) so the test lives with other word-list
validation tests; update imports/references if necessary (e.g., ensure
load_supplement_words and load_word_list are available) and run the test suite
to confirm no name collisions or xfail-scoping issues after relocating them.

Comment thread README.md Outdated
Comment thread scripts/improve_word_lists.py Outdated
Comment thread scripts/improve_word_lists.py
Pipeline improvements:
- Filter blocklisted words from daily word rotation (Hebrew: 88% overlap → 0%)
- Filter Roman numerals from daily words and supplements (Breton: 107 → 0)
- Filter common proper nouns (387 international names) from daily words
- Normalize Greek final sigma (σ→ς) in supplements with de-dup

Data fixes:
- German blocklist: add arsch, fotze, nsdap
- Greek supplement: fix 15 words with wrong final sigma

Keyboard layouts for 6 languages:
- Macedonian (mk): Cyrillic layout based on Serbian
- Armenian (hy, hyw): typewriter + phonetic layouts
- Esperanto (eo): QWERTY with accented letters
- Basque (eu): QWERTY with ñ, ç
- Breton (br): AZERTY with ê, ñ, ù

Tests: add TestDailyWordQuality (blocklist overlap + Roman numeral checks)
Cleanup: remove dead code in app.py
Regenerate daily words and supplements for all 38 languages
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
tests/conftest.py (2)

62-64: line.strip() evaluated up to three times per iteration in both helpers

Each list/set comprehension calls line.strip() once for the empty-check, once for the startswith guard, and once for the result value. A walrus operator resolves this at no cost to readability:

♻️ Proposed refactor
-        return [
-            line.strip().lower() for line in f if line.strip() and not line.strip().startswith("#")
-        ]
+        return [
+            stripped.lower()
+            for line in f
+            if (stripped := line.strip()) and not stripped.startswith("#")
+        ]
-        return {
-            line.strip().lower() for line in f if line.strip() and not line.strip().startswith("#")
-        }
+        return {
+            stripped.lower()
+            for line in f
+            if (stripped := line.strip()) and not stripped.startswith("#")
+        }

Also applies to: 73-75

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/conftest.py` around lines 62 - 64, The list/set comprehensions in the
helpers repeatedly call line.strip() multiple times per iteration; change each
comprehension to compute the stripped line once using the walrus operator (e.g.,
assign stripped = line.strip() in the comprehension's if/for expressions) and
then use that single variable for the empty-check, startswith guard, and the
lowercased/result value so you avoid redundant strip calls in the helper that
returns the list (lines around the shown comprehension) and the analogous helper
at lines 73-75.

35-41: load_word_list does not lowercase, creating a silent asymmetry with load_daily_words

New TestDailyWords tests likely assert that every daily word exists in the main word list (e.g., word in set(load_word_list(lang))). load_daily_words returns lowercased words; load_word_list does not normalize case. If a future _5words.txt file contains any uppercase entry, the subset check would produce a false failure. Applying the same .lower() normalization to load_word_list (and load_supplement_words) keeps the helpers consistent and test results trustworthy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/conftest.py` around lines 35 - 41, load_word_list currently returns
words without lowercasing, causing inconsistency with load_daily_words which
lowercases entries; update load_word_list to normalize each returned word with
.lower() (and apply the same normalization to load_supplement_words) so all
helper loaders consistently return lowercase strings, ensuring subset checks
like in TestDailyWords behave deterministically; locate and modify the functions
load_word_list and load_supplement_words to map .lower() over stripped lines
before returning.
scripts/improve_word_lists.py (2)

355-359: Redundant is_valid_word check.

valid_freq is already built from freq_data filtered by is_valid_word (line 283), so the second is_valid_word(w, char_set) guard on line 358 is always true for any w in valid_freq.

♻️ Proposed simplification
     new_supplement = {
-        w for w in valid_freq if w not in existing_word_set and is_valid_word(w, char_set)
+        w for w in valid_freq if w not in existing_word_set
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 355 - 359, The set comprehension
building new_supplement redundantly re-checks is_valid_word for items already
filtered into valid_freq; update the comprehension to only include words from
valid_freq that are not in existing_word_set (remove the is_valid_word(w,
char_set) condition) so new_supplement is computed as {w for w in valid_freq if
w not in existing_word_set}; keep the variable names new_supplement, valid_freq,
existing_word_set unchanged.

466-484: Consider adding a timeout to subprocess.run calls.

These git operations involve network I/O and could hang indefinitely on flaky connections. A timeout (e.g., timeout=300) would make batch/CI usage more predictable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 466 - 484, The two networked git
calls using subprocess.run in scripts/improve_word_lists.py (the clone call and
the sparse-checkout call) should include a timeout to avoid hanging in CI;
update both subprocess.run invocations (the one cloning
"https://github.com/hermitdave/FrequencyWords.git" and the one running "git
sparse-checkout set content/2018") to pass a timeout parameter (e.g.,
timeout=300) and optionally wrap them in a try/except for
subprocess.TimeoutExpired to handle/log timeout failures cleanly.
tests/test_word_lists.py (1)

340-359: Roman numeral logic is duplicated from scripts/improve_word_lists.py.

_ROMAN_RE and _is_roman_numeral here are nearly identical to _ROMAN_RE / is_roman_numeral in scripts/improve_word_lists.py (lines 173–197). If either copy diverges, false positives or negatives will go unnoticed.

Consider extracting the shared logic into a common module (e.g., a small utils or reuse the script's function directly) so there's a single source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_word_lists.py` around lines 340 - 359, Duplicate Roman numeral
logic exists as _ROMAN_RE and _is_roman_numeral in tests/test_word_lists.py and
as _ROMAN_RE / is_roman_numeral in scripts/improve_word_lists.py; refactor by
extracting the shared logic into a single utility (e.g., utils/roman.py) or by
importing the existing is_roman_numeral from scripts/improve_word_lists.py, then
update tests to call the shared function instead of the local _is_roman_numeral
and remove the duplicated _ROMAN_RE/_is_roman_numeral definitions so there is a
single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/common_names.txt`:
- Around line 214-217: The list contains a duplicate entry "jones"; remove the
repeated "jones" occurrence so the name appears only once in the names list
(delete the later duplicate), and quickly scan the surrounding entries to ensure
there are no other accidental duplicates.

In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 749: Remove the three offensive entries from the daily-answer pool by
deleting the words "huren", "neger", and "negro" from de_daily_words.txt (these
are the entries at/around the reported locations) so they cannot be served as
daily answers; also add those same words to de_blocklist.txt as a secondary
safeguard (even though app.py currently bypasses blocklist for daily-words,
keeping them in the blocklist prevents other paths from using them).

---

Duplicate comments:
In `@scripts/improve_word_lists.py`:
- Around line 621-627: The current CLI handler treats process_language results
with result["status"] == "skipped" as success; update the command branch that
calls process_language (the args.command == "process" block) to treat any
non-"ok" status as failure: after calling process_language(args.lang,
args.daily_count, args.dry_run, args.overwrite) check result["status"] and if it
is not "ok" (e.g., "skipped", "error", or any unknown value) print the reason to
stderr and call sys.exit(1) (preserving the existing error print for "error"
cases), so skipped/unmapped languages produce a non-zero exit; refer to the
variables args.command, process_language(...), result["status"], and sys.exit(1)
when making the change.
- Line 539: The language "ko" is in EXCLUDE so it's always skipped and should be
removed from the priority list to avoid noisy summaries; edit the priority list
variable that currently contains "ko" (e.g., BATCH_PRIORITY or PRIORITY_LANGS)
and delete the "ko" entry so EXCLUDE remains authoritative and the summary
output no longer lists a redundant skipped language.
- Around line 75-76: The EXCLUDE set uses a single blanket comment but the code
path that logs exclusions (the branch that prints "excluded (already high
quality)") needs per-language reasons; change EXCLUDE from a set to a dict
mapping language code → reason (e.g., {"ko": "FrequencyWords uses syllable
blocks", "en": "already high quality", ...}), update any membership checks
against EXCLUDE to key lookups (use EXCLUDE.get(lang) to obtain the reason), and
replace the fixed log text "excluded (already high quality)" with the
per-language reason when reporting exclusions; update references to EXCLUDE in
the script (e.g., where EXCLUDE is checked and the exclusion message is emitted)
to use the new dict semantics.

In `@tests/test_word_lists.py`:
- Around line 238-242: In class TestDailyWords update the mutable class-level
variable SUPPLEMENT_OVERLAP_XFAIL to include a ClassVar annotation to satisfy
Ruff RUF012 (e.g. add "from typing import ClassVar" and change the declaration
to "SUPPLEMENT_OVERLAP_XFAIL: ClassVar[set[str]] = {\"pl\", \"ckb\"}"); locate
the symbol SUPPLEMENT_OVERLAP_XFAIL in TestDailyWords and add the import and
annotation accordingly (alternatively use ClassVar[frozenset[str]] with a
frozenset literal if you prefer immutability).

In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Around line 426-427: Remove the non-normative duplicated entry "danés" and
keep only the normative Central Catalan form "danès" in the daily pool; locate
the two adjacent entries "danès" and "danés" in the ca_daily_words list and
delete the "danés" line so only "danès" remains.

In `@webapp/data/languages/de/de_blocklist.txt`:
- Around line 27-32: The blocklist entries in de_blocklist.txt (e.g., "arsch",
"fotze", "nsdap") are ineffective because app.py bypasses this file when
de_daily_words.txt exists (it passes set()), and "nutte" remains in
de_daily_words.txt; to fix, remove all offensive words (including "nutte")
directly from de_daily_words.txt so they never appear in the daily-word pool,
and update app.py to pass the parsed blocklist (not an empty set) into the
daily-word selector or explicitly document/ensure the intended precedence
between de_daily_words.txt and de_blocklist.txt (search for usages of
de_daily_words.txt and the code path that passes set() in app.py to locate and
correct the behavior).

In `@webapp/data/languages/de/de_daily_words.txt`:
- Line 1230: Remove the offensive word "nutte" from the daily German word pool
by deleting it from webapp/data/languages/de/de_daily_words.txt (line containing
"nutte") and add the same token to webapp/data/languages/de/de_blocklist.txt to
prevent future re-addition; ensure the blocklist entry matches the exact
casing/spelling used in de_daily_words.txt.

---

Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 355-359: The set comprehension building new_supplement redundantly
re-checks is_valid_word for items already filtered into valid_freq; update the
comprehension to only include words from valid_freq that are not in
existing_word_set (remove the is_valid_word(w, char_set) condition) so
new_supplement is computed as {w for w in valid_freq if w not in
existing_word_set}; keep the variable names new_supplement, valid_freq,
existing_word_set unchanged.
- Around line 466-484: The two networked git calls using subprocess.run in
scripts/improve_word_lists.py (the clone call and the sparse-checkout call)
should include a timeout to avoid hanging in CI; update both subprocess.run
invocations (the one cloning "https://github.com/hermitdave/FrequencyWords.git"
and the one running "git sparse-checkout set content/2018") to pass a timeout
parameter (e.g., timeout=300) and optionally wrap them in a try/except for
subprocess.TimeoutExpired to handle/log timeout failures cleanly.

In `@tests/conftest.py`:
- Around line 62-64: The list/set comprehensions in the helpers repeatedly call
line.strip() multiple times per iteration; change each comprehension to compute
the stripped line once using the walrus operator (e.g., assign stripped =
line.strip() in the comprehension's if/for expressions) and then use that single
variable for the empty-check, startswith guard, and the lowercased/result value
so you avoid redundant strip calls in the helper that returns the list (lines
around the shown comprehension) and the analogous helper at lines 73-75.
- Around line 35-41: load_word_list currently returns words without lowercasing,
causing inconsistency with load_daily_words which lowercases entries; update
load_word_list to normalize each returned word with .lower() (and apply the same
normalization to load_supplement_words) so all helper loaders consistently
return lowercase strings, ensuring subset checks like in TestDailyWords behave
deterministically; locate and modify the functions load_word_list and
load_supplement_words to map .lower() over stripped lines before returning.

In `@tests/test_word_lists.py`:
- Around line 340-359: Duplicate Roman numeral logic exists as _ROMAN_RE and
_is_roman_numeral in tests/test_word_lists.py and as _ROMAN_RE /
is_roman_numeral in scripts/improve_word_lists.py; refactor by extracting the
shared logic into a single utility (e.g., utils/roman.py) or by importing the
existing is_roman_numeral from scripts/improve_word_lists.py, then update tests
to call the shared function instead of the local _is_roman_numeral and remove
the duplicated _ROMAN_RE/_is_roman_numeral definitions so there is a single
source of truth.

Comment thread scripts/common_names.txt Outdated
Comment thread webapp/data/languages/de/de_daily_words.txt Outdated
wordfreq integration:
- Add wordfreq library as second data source (Wikipedia, Reddit, Twitter,
  Google Books) to supplement FrequencyWords (OpenSubtitles)
- Smart foreign-word filter: skip words more common in English than the
  target language (removes Wikipedia noise like tech terms, English words)
- Massive supplement growth: Italian 3.5K→11.8K, French 3.1K→15.4K,
  Spanish 3.6K→14K, German 3.4K→9.9K, Esperanto 0.7K→20.3K

Keyboard layouts for 17 remaining languages:
- QWERTY: ga, gd, fy, ia, ie, la, rw, qya, tlh, az, ltg, tk, fur
- QWERTZ: lb, nds
- AZERTY: oc
- Cyrillic: mn (ЙЦУКЕН-based)

Data quality fixes:
- Portuguese: lowercase 24 proper nouns, remove 10 duplicates, add to blocklist
- Latin: remove 35 abbreviations with periods, clean character set
- xfail updates for la, az, oc, qya (word list quality issues)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
webapp/data/languages/az/az_keyboard.json (1)

1-13: Overall structure and Azerbaijani letter coverage look correct.

The JSON is valid, the "default" key matches the layout name, navigation keys (/) are in place, and all 32 Azerbaijani Latin letters are represented (pending removal of "w" per the comment above). One optional nit: the "label" value "QWERTY" could be "QÜERTY" to match the official name of the Azerbaijani layout, but this is cosmetic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/languages/az/az_keyboard.json` around lines 1 - 13, Update the
Azerbaijani layout's label to the official spelling by changing the "label"
value inside the "azerbaijani_qwerty" layout from "QWERTY" to "QÜERTY"; locate
the "azerbaijani_qwerty" object and edit its "label" property accordingly
(optionally, also follow up on the earlier note about removing the "w" key from
the first row if that upstream change is applied).
tests/test_language_config.py (1)

156-156: Annotate KEYBOARD_COVERAGE_XFAIL with ClassVar to silence Ruff RUF012

Ruff flags mutable class-level attributes that aren't annotated as ClassVar. The same pattern applies to KEYBOARD_COVERAGE_XFAIL here and in test_word_lists.py.

♻️ Proposed fix
+from typing import ClassVar
 
 class TestKeyboardConfig:
     # Languages with known keyboard coverage gaps (complex scripts, incomplete keyboards)
-    KEYBOARD_COVERAGE_XFAIL = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}
+    KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_language_config.py` at line 156, KEYBOARD_COVERAGE_XFAIL is a
mutable class-level attribute flagged by Ruff RUF012; annotate it as a ClassVar
to indicate it is not an instance attribute. Update the declaration of
KEYBOARD_COVERAGE_XFAIL (and the same variable in test_word_lists.py) to use
typing.ClassVar, e.g. ClassVar[set[str]] or ClassVar[FrozenSet[str]] if you
prefer immutability, and add the necessary import for ClassVar at the top of
each file.
tests/test_word_lists.py (2)

337-382: _is_roman_numeral duplicates is_roman_numeral from scripts/improve_word_lists.py

TestDailyWordQuality._is_roman_numeral (lines 342–359) is a near-verbatim copy of is_roman_numeral in scripts/improve_word_lists.py (lines 225–246). Any future fix to the logic (e.g., correcting subtractive-pair validation) must be applied in both places.

If the test harness cannot import from scripts/, consider extracting the function into a shared utility module (e.g., webapp/utils.py or a dedicated wordle_utils package) that both the test and the script can import from.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_word_lists.py` around lines 337 - 382,
TestDailyWordQuality._is_roman_numeral duplicates the logic in
scripts.improve_word_lists.is_roman_numeral; extract the Roman-numeral checker
into a single shared utility (e.g., create function is_roman_numeral in
webapp.utils or a new wordle_utils module) and replace the copy in
tests/test_word_lists.py by importing that shared is_roman_numeral, and update
scripts/improve_word_lists.py to import the same shared function instead of
defining its own; ensure the test class references the imported function (or
calls the shared is_roman_numeral) and remove the _is_roman_numeral method to
eliminate duplication.

145-145: Annotate KEYBOARD_COVERAGE_XFAIL with ClassVar to silence Ruff RUF012

Identical pattern to test_language_config.py line 156.

♻️ Proposed fix
+from typing import ClassVar
 
 class TestKeyboardCoverage:
-    KEYBOARD_COVERAGE_XFAIL = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}
+    KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_word_lists.py` at line 145, The module-level constant
KEYBOARD_COVERAGE_XFAIL should be annotated as a ClassVar to satisfy Ruff
RUF012: add a typing import for ClassVar (and Set if not already imported) and
change the declaration of KEYBOARD_COVERAGE_XFAIL to something like
KEYBOARD_COVERAGE_XFAIL: ClassVar[Set[str]] = {"vi", "ko", "el", "pt", "pau",
"la", "az", "oc", "qya"} so the static analyzer recognizes it as a
class/module-level constant.
scripts/improve_word_lists.py (1)

586-604: Ruff S607: use the full path to git instead of a bare executable name

Passing a partial executable path to subprocess.run is flagged by Ruff S607. For a developer tool this is low risk, but using shutil.which("git") to resolve the full path eliminates the PATH-injection vector.

♻️ Proposed fix
+import shutil
+
 def download_frequency_words():
     ...
+    git_path = shutil.which("git")
+    if not git_path:
+        raise RuntimeError("git not found in PATH")
     subprocess.run(
-        ["git", "clone", "--depth", "1", "--filter=blob:none", "--sparse",
-         "https://github.com/hermitdave/FrequencyWords.git"],
+        [git_path, "clone", "--depth", "1", "--filter=blob:none", "--sparse",
+         "https://github.com/hermitdave/FrequencyWords.git"],
         cwd=target, check=True,
     )
     subprocess.run(
-        ["git", "sparse-checkout", "set", "content/2018"],
+        [git_path, "sparse-checkout", "set", "content/2018"],
         cwd=repo_dir, check=True,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 586 - 604, Replace bare "git"
invocations passed to subprocess.run with the full path resolved via
shutil.which("git") to avoid PATH-injection; in scripts/improve_word_lists.py
obtain git_path = shutil.which("git") near where subprocess.run is called,
validate git_path is not None (raise an informative exception if it is), and
then use [git_path, "clone", ...] and [git_path, "sparse-checkout", "set",
"content/2018"] instead of the string "git"; keep existing cwd and check=True
arguments and reuse variables repo_dir and target to locate the calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 12: The pyproject currently lists the package "wordfreq" under production
dependencies but it's only imported via "import wordfreq" in an optional script
and not used by the webapp; move "wordfreq" out of the main dependencies and add
it to the dev (or development) dependency group in pyproject.toml so it is
installed only for development/test builds, leaving the try/except usage in the
script intact and ensuring no runtime imports in the webapp (e.g., module
webapp.app) depend on it.

In `@webapp/data/languages/az/az_keyboard.json`:
- Line 7: Remove the spurious "w" entry from the first keyboard row array (the
row currently ["q","w","e","ə","r","t","y","u","ü","i","ı","o","ö","p"]) so the
row matches the Azerbaijani QÜERTY layout (keep "ü" in place and do not add any
other characters); locate the array in az_keyboard.json (look for the array
starting with "q") and delete only the "w" string so the keyboard letters match
the official Azerbaijani 32-letter set.

In `@webapp/data/languages/da/da_5words_supplement.txt`:
- Around line 1-2798: The supplement file da_5words_supplement.txt includes
OCR-corrupted entries (e.g., "haiio","hjaip","iiget","ldiot","roiig") that pass
is_valid_word() because characters are alphabetic; add a post-generation
blocklist/deny-list pass after is_valid_word() that filters known-bad patterns
(e.g., sequences of >=3 identical vowels like "iii", suspicious repeated 'i'/'l'
runs, common l/I confusion patterns like leading 'l' where 'i' expected) or load
a curated per-language deny-list to remove these entries from the FrequencyWords
output; implement this as a small filter function (e.g.,
filter_bad_ocr_artifacts()) invoked before finalizing the supplement so the
listed examples are excluded.

---

Duplicate comments:
In `@scripts/improve_word_lists.py`:
- Around line 744-747: The current post-call handling of process_language only
exits non-zero on result["status"] == "error", so skipped languages
(result["status"] == "skipped") return 0; change the logic in the block after
process_language (the code referencing result, result["status"], and printing
the message) to treat "skipped" as a failure for CI by printing the
warning/error to stderr and calling sys.exit(1) when result["status"] is not
"ok" (or explicitly when it's "skipped" or "error"), keeping the same message
logic via result.get("reason") so callers receive a non-zero exit code on
skipped/unknown languages.
- Around line 79-80: Change EXCLUDE from a set to a dict mapping language code
to an exclusion reason (e.g., EXCLUDE = {"en":"already high quality",
"ko":"syllable blocks produce zero 5-letter matches", ...}), then update
process_language to check EXCLUDE.get(lang) and set result["reason"] to the dict
value (not the hardcoded "excluded (already high quality)"); finally, remove
"ko" from the batch_process priority list so it isn't listed as "skipped" in
summaries. This touches the EXCLUDE constant, the process_language function
where result["reason"] is assigned, and the batch_process priority list.

In `@tests/test_word_lists.py`:
- Line 242: The constant SUPPLEMENT_OVERLAP_XFAIL should be annotated with
ClassVar to satisfy RUF012; update its definition in tests/test_word_lists.py to
include a ClassVar type annotation (e.g., SUPPLEMENT_OVERLAP_XFAIL:
ClassVar[set[str]] = {"pl", "ckb"}) and ensure ClassVar is imported from typing
at the top of the module (add "from typing import ClassVar" if missing).

---

Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 586-604: Replace bare "git" invocations passed to subprocess.run
with the full path resolved via shutil.which("git") to avoid PATH-injection; in
scripts/improve_word_lists.py obtain git_path = shutil.which("git") near where
subprocess.run is called, validate git_path is not None (raise an informative
exception if it is), and then use [git_path, "clone", ...] and [git_path,
"sparse-checkout", "set", "content/2018"] instead of the string "git"; keep
existing cwd and check=True arguments and reuse variables repo_dir and target to
locate the calls.

In `@tests/test_language_config.py`:
- Line 156: KEYBOARD_COVERAGE_XFAIL is a mutable class-level attribute flagged
by Ruff RUF012; annotate it as a ClassVar to indicate it is not an instance
attribute. Update the declaration of KEYBOARD_COVERAGE_XFAIL (and the same
variable in test_word_lists.py) to use typing.ClassVar, e.g. ClassVar[set[str]]
or ClassVar[FrozenSet[str]] if you prefer immutability, and add the necessary
import for ClassVar at the top of each file.

In `@tests/test_word_lists.py`:
- Around line 337-382: TestDailyWordQuality._is_roman_numeral duplicates the
logic in scripts.improve_word_lists.is_roman_numeral; extract the Roman-numeral
checker into a single shared utility (e.g., create function is_roman_numeral in
webapp.utils or a new wordle_utils module) and replace the copy in
tests/test_word_lists.py by importing that shared is_roman_numeral, and update
scripts/improve_word_lists.py to import the same shared function instead of
defining its own; ensure the test class references the imported function (or
calls the shared is_roman_numeral) and remove the _is_roman_numeral method to
eliminate duplication.
- Line 145: The module-level constant KEYBOARD_COVERAGE_XFAIL should be
annotated as a ClassVar to satisfy Ruff RUF012: add a typing import for ClassVar
(and Set if not already imported) and change the declaration of
KEYBOARD_COVERAGE_XFAIL to something like KEYBOARD_COVERAGE_XFAIL:
ClassVar[Set[str]] = {"vi", "ko", "el", "pt", "pau", "la", "az", "oc", "qya"} so
the static analyzer recognizes it as a class/module-level constant.

In `@webapp/data/languages/az/az_keyboard.json`:
- Around line 1-13: Update the Azerbaijani layout's label to the official
spelling by changing the "label" value inside the "azerbaijani_qwerty" layout
from "QWERTY" to "QÜERTY"; locate the "azerbaijani_qwerty" object and edit its
"label" property accordingly (optionally, also follow up on the earlier note
about removing the "w" key from the first row if that upstream change is
applied).

Comment thread pyproject.toml Outdated
Comment thread webapp/data/languages/az/az_keyboard.json Outdated
Comment thread webapp/data/languages/da/da_5words_supplement.txt
- Move wordfreq to dev dependencies (only used in scripts)
- Remove docs/DICTIONARY_IMPROVEMENT_PLAN.md from repo
- Remove 'w' from Azerbaijani keyboard (not in Azerbaijani alphabet)
- Remove danés from Catalan daily words (keep danès, the Catalan form)
- Remove huren/neger/negro from German daily words + add to blocklist
- Remove 20 OCR-corrupted entries from Danish supplement
- Remove duplicate jones from common_names.txt
- Fix README heading level (### → ##)
- Fix ko exclusion reason and process command exit code
- Add ClassVar type annotation for SUPPLEMENT_OVERLAP_XFAIL
- Remove 'dəng:' (colon artifact) from main word list
- Lowercase 3 proper nouns in supplement (Ağyol, Allah, Quran)
- Remove az from CHARACTER_XFAIL and KEYBOARD_COVERAGE_XFAIL
- 5 more tests passing (1972 passed, 28 xfailed)
@Hugo0
Copy link
Copy Markdown
Owner Author

Hugo0 commented Feb 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Data quality fixes:
- Portuguese: remove 29 invalid entries (whitespace, periods, asterisks),
  complete diacritic_map for keyboard coverage
- Palauan: lowercase 24 proper nouns, remove 222 duplicates, 1 non-alpha word
- Quenya: remove 7 words with punctuation (!, -)
- Occitan: NFC-normalize combining accents, remove 1 resulting wrong-length word
- Polish: remove 2,406 supplement/main overlaps
- Kurdish: remove 990 supplement/main overlaps + 1 wrong-length word

Keyboard/diacritic fixes:
- Greek: add diacritic_map (ά→α, έ→ε, etc.)
- Vietnamese: add comprehensive diacritic_map (59 tonal variants)
- Latin: remove from xfail (already passing)

Result: 2004 passed, 289 skipped, 2 xfailed (Korean only)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
scripts/improve_word_lists.py (2)

586-604: Ruff S607: subprocess invoked with partial executable path "git"

Ruff flags both subprocess.run calls as S607 because "git" is resolved via PATH, which could be manipulated in non-trusted environments. For a dev script this is low-risk, but the fix is trivial using shutil.which:

♻️ Proposed fix
+import shutil
+
+_GIT = shutil.which("git") or "git"
 
 subprocess.run(
     [
-        "git",
+        _GIT,
         "clone",
         ...
     ],
     ...
 )
 
 subprocess.run(
-    ["git", "sparse-checkout", "set", "content/2018"],
+    [_GIT, "sparse-checkout", "set", "content/2018"],
     ...
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` around lines 586 - 604, The subprocess.run
calls that invoke "git" (the two occurrences in scripts/improve_word_lists.py
using subprocess.run([...], cwd=target, ...) and subprocess.run([...],
cwd=repo_dir, ...)) should resolve the git executable with shutil.which and use
that absolute path to avoid PATH injection; locate where "git" is passed as the
first element of the argv lists, call shutil.which("git"), raise a clear
exception if None, and replace the literal "git" entries with the returned path
for both clone and sparse-checkout subprocess.run invocations.

491-491: Ruff RUF003: ambiguous Unicode σ character in source comment

The literal σ (U+03C3, GREEK SMALL LETTER SIGMA) in the comment can render inconsistently across editors and is flagged by Ruff. Spell it out or use the Unicode escape:

♻️ Proposed fix
-    # Greek: normalize final sigma (σ at word end → ς)
+    # Greek: normalize final sigma (U+03C3 at word end → U+03C2 final sigma)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/improve_word_lists.py` at line 491, Summary: the inline comment
contains the literal Greek small letter sigma (σ) which Ruff flags as ambiguous;
update the comment to avoid the direct Unicode glyph. Fix: edit the comment that
reads like "# Greek: normalize final sigma (σ at word end → ς)" and replace the
literal σ with either the spelled-out word "sigma" or the Unicode escape
sequence "\u03C3" (e.g. "# Greek: normalize final sigma (sigma at word end → ς)"
or "# Greek: normalize final sigma (\u03C3 at word end → ς)"); keep the rest of
the comment intact and ensure the chosen form matches project linting
preferences.
tests/test_language_config.py (1)

156-156: Ruff RUF012: mutable set as a class attribute — add ClassVar

The narrowing to {"ko"} is a welcome improvement. However, Ruff (RUF012) flags the mutable set literal as a class attribute without ClassVar. Add the annotation to silence the warning and clarify intent:

♻️ Proposed fix
+from typing import ClassVar
+
 class TestKeyboardConfig:
     # Languages with known keyboard coverage gaps (complex scripts, incomplete keyboards)
-    KEYBOARD_COVERAGE_XFAIL = {"ko"}
+    KEYBOARD_COVERAGE_XFAIL: ClassVar[set[str]] = {"ko"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_language_config.py` at line 156, KEYBOARD_COVERAGE_XFAIL is
defined as a mutable set at class scope and triggers Ruff RUF012; update its
annotation to be a ClassVar to indicate it’s not an instance attribute (e.g.,
annotate KEYBOARD_COVERAGE_XFAIL as ClassVar[Set[str]]), and add the necessary
imports (ClassVar and Set) to the module if missing; keep the assigned value as
{"ko"} but change the type to ClassVar[Set[str]] to silence the linter and
clarify intent.
tests/test_daily_word.py (1)

149-149: Annotate with ClassVar to match the fix already applied in test_word_lists.py.

Ruff RUF012 flags this mutable class-level default. The same issue was already addressed for SUPPLEMENT_OVERLAP_XFAIL in test_word_lists.py by adding a ClassVar annotation — apply the same pattern here for consistency.

🔧 Proposed fix
+from typing import ClassVar
+
 class TestWordCycling:
     """Tests for word list cycling behavior."""

     # Languages with duplicate words causing early repeats
-    REPEAT_XFAIL: set[str] = set()
+    REPEAT_XFAIL: ClassVar[set[str]] = set()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_daily_word.py` at line 149, The class-level mutable default
REPEAT_XFAIL should be annotated as a ClassVar like the similar fix in
test_word_lists.py to silence RUF012; update the declaration of REPEAT_XFAIL to
use ClassVar[set[str]] = set() and add/import ClassVar from typing at the top of
tests/test_daily_word.py if it isn't already present so the mutable set is
explicitly a class variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@webapp/data/languages/ca/ca_daily_words.txt`:
- Around line 78-79: Remove the non-standard duplicate "aprés" (with acute
accent) so only the normative "après" (with grave accent) remains in the word
list; locate the two adjacent entries "après" and "aprés" and delete the "aprés"
entry to avoid rejecting correct guesses that use the normative form.

In `@webapp/data/languages/da/da_5words_supplement.txt`:
- Line 490: Remove the four offensive entries ("cunth", "negro", "nigga",
"pussy") from the Danish supplement file and add them to the Danish blocklist:
delete these exact tokens from da_5words_supplement.txt and append them (one per
line, matching existing blocklist formatting and avoiding duplicates) to
da_blocklist.txt so the blocklist infrastructure will catch them.
- Line 1055: Remove the listed OCR-corrupted entries from
webapp/data/languages/da/da_5words_supplement.txt (replace or delete: hjæip →
remove, iader, iagde, iangt, iaver, ieder, iever, iugte, iukke, iyder, iykke,
iyset, iyver, iægen, iægge, iænge, iærer, iøber, iøbet) and delete the four
offensive entries (cunth, negro, nigga, pussy); then update
scripts/improve_word_lists.py to extend the OCR filter (the validation function
that currently filters OCR artifacts—e.g., is_valid_word/filter_word) to reject
words matching the new regexes ^i[yæø] and ^ia so these corrupted l→i forms are
filtered out going forward, and add the four offensive words to
webapp/data/languages/da/da_blocklist.txt.

In `@webapp/data/languages/de/de_blocklist.txt`:
- Around line 27-36: The de_blocklist.txt offensive section is missing four
German offensive terms that remain in de_daily_words.txt; update
de_blocklist.txt (offensive/inappropriate section) to include "nutte", "dildo",
"dirne", and "tunte" so those tokens are blocked by the general blocklist used
outside app.py, ensuring defense-in-depth across code paths (referenced files:
de_blocklist.txt, de_daily_words.txt, and app.py).

In `@webapp/data/languages/de/de_daily_words.txt`:
- Around line 341-345: de_daily_words.txt currently contains offensive German
terms and is bypassing the blocklist, so remove the listed entries from the
daily word file and add them to the blocklist: remove "dildo", "dirne", "fürze",
"nutte", and "tunte" from de_daily_words.txt and append them to de_blocklist.txt
(one per line); verify the daily-word selector now receives the actual blocklist
set rather than an empty set after this change and run the selector to confirm
those words are no longer served.

---

Duplicate comments:
In `@tests/test_word_lists.py`:
- Around line 30-32: Add a ClassVar annotation for each mutable class-level set
to satisfy Ruff RUF012: import ClassVar from typing, then change the type of
each class attribute like LOWERCASE_XFAIL, DUPLICATE_XFAIL,
SUPPLEMENT_LENGTH_XFAIL, SUPPLEMENT_OVERLAP_XFAIL and the other three set[str]
attributes to ClassVar[set[str]] so the mutable sets are explicitly marked as
class variables.

---

Nitpick comments:
In `@scripts/improve_word_lists.py`:
- Around line 586-604: The subprocess.run calls that invoke "git" (the two
occurrences in scripts/improve_word_lists.py using subprocess.run([...],
cwd=target, ...) and subprocess.run([...], cwd=repo_dir, ...)) should resolve
the git executable with shutil.which and use that absolute path to avoid PATH
injection; locate where "git" is passed as the first element of the argv lists,
call shutil.which("git"), raise a clear exception if None, and replace the
literal "git" entries with the returned path for both clone and sparse-checkout
subprocess.run invocations.
- Line 491: Summary: the inline comment contains the literal Greek small letter
sigma (σ) which Ruff flags as ambiguous; update the comment to avoid the direct
Unicode glyph. Fix: edit the comment that reads like "# Greek: normalize final
sigma (σ at word end → ς)" and replace the literal σ with either the spelled-out
word "sigma" or the Unicode escape sequence "\u03C3" (e.g. "# Greek: normalize
final sigma (sigma at word end → ς)" or "# Greek: normalize final sigma (\u03C3
at word end → ς)"); keep the rest of the comment intact and ensure the chosen
form matches project linting preferences.

In `@tests/test_daily_word.py`:
- Line 149: The class-level mutable default REPEAT_XFAIL should be annotated as
a ClassVar like the similar fix in test_word_lists.py to silence RUF012; update
the declaration of REPEAT_XFAIL to use ClassVar[set[str]] = set() and add/import
ClassVar from typing at the top of tests/test_daily_word.py if it isn't already
present so the mutable set is explicitly a class variable.

In `@tests/test_language_config.py`:
- Line 156: KEYBOARD_COVERAGE_XFAIL is defined as a mutable set at class scope
and triggers Ruff RUF012; update its annotation to be a ClassVar to indicate
it’s not an instance attribute (e.g., annotate KEYBOARD_COVERAGE_XFAIL as
ClassVar[Set[str]]), and add the necessary imports (ClassVar and Set) to the
module if missing; keep the assigned value as {"ko"} but change the type to
ClassVar[Set[str]] to silence the linter and clarify intent.

Comment thread webapp/data/languages/ca/ca_daily_words.txt Outdated
Comment thread webapp/data/languages/da/da_5words_supplement.txt Outdated
Comment thread webapp/data/languages/da/da_5words_supplement.txt Outdated
Comment thread webapp/data/languages/de/de_blocklist.txt
Comment thread webapp/data/languages/de/de_daily_words.txt Outdated
- Remove 7 dead xfail sets and their check blocks (all resolved)
- Deduplicate is_roman_numeral(): tests now import from script
- Fix docstring placement in test_keyboard_covers_all_word_characters
- Fix is_likely_foreign() fallback: return False for unmapped languages
  instead of using lang code directly (could silently filter all words)
- Use final_form_map config instead of hardcoded Greek sigma normalization
- Fix SOURCES.md template: don't claim Hunspell for all languages,
  mention wordfreq when used
- Add comment explaining FREQ_LANG_MAP vs WORDFREQ_LANG_MAP differences
@Hugo0
Copy link
Copy Markdown
Owner Author

Hugo0 commented Feb 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

- Catalan: remove non-standard aprés (keep normative après)
- Danish: remove 19 more OCR l→i artifacts + 4 offensive English words
- Danish: add cunth/negro/nigga/pussy to blocklist
- German: remove dildo/dirne/fürze/nutte/tunte from daily words
- German: add dildo/dirne/fürze/nutte/tunte to blocklist
Blocklist should only prevent words from being daily answers, not from
being valid guesses. Players should be able to type any real word.
Hugo0 added a commit that referenced this pull request Feb 22, 2026
feat: Frequency-ranked daily words and supplements for 38 languages
@coderabbitai coderabbitai Bot mentioned this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant