Skip to content

Optimize corpus tests: mock downloads, separate data validation, suppress CLI output#1291

Merged
bact merged 24 commits intodevfrom
copilot/improve-corpus-test-speed
Feb 6, 2026
Merged

Optimize corpus tests: mock downloads, separate data validation, suppress CLI output#1291
bact merged 24 commits intodevfrom
copilot/improve-corpus-test-speed

Conversation

Copy link
Contributor

Copilot AI commented Feb 6, 2026

What do these changes do

Separates corpus testing concerns: fast mocked unit tests vs comprehensive data validation. Eliminates network dependencies and output pollution from regular test runs.

What was wrong

Unit tests: Downloaded large corpus files (96MB OSCAR, 186MB TNC) on every run, taking 21s and requiring network access. Caused slow development cycles and flaky tests.

CLI tests: Printed verbose catalog listings and corpus metadata to logs, obscuring test failures. Made network requests during testing.

Missing coverage: No tests for corpus catalog operations (get_corpus_db, corpus_db_url, catalog structure validation).

How this fixes it

1. Mock corpus downloads in unit tests (tests/core/test_corpus.py)

# Before: Downloads 96MB, parses 8.6s
def test_oscar(self):
    self.assertIsNotNone(oscar.word_freqs())

# After: Mock data, validates parsing, <0.01s
with patch('pythainlp.corpus.oscar.get_corpus_path', return_value="/mock"):
    with patch('builtins.open', mock_open(read_data=mock_csv)):
        result = oscar.word_freqs()
        # Validates CSV parsing, empty string handling, quote filtering

Result: 21s → 2s (90% faster), no network dependency

2. Separate corpus data workflow (.github/workflows/corpus.yml)

  • Tests actual file loading without mocks
  • Triggers only on pythainlp/corpus/** or tests/corpus/** changes
  • Three categories: catalog, built-in, downloadable

3. Corpus catalog tests (tests/corpus/test_catalog.py)

  • Catalog download and JSON structure validation
  • Known entries verification
  • Version metadata checks
  • Local DB query testing

4. Suppress CLI output (tests/core/test_cli.py, tests/extra/testx_cli.py)

with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
    with patch('pythainlp.corpus.download', return_value=True):
        DataApp(["thainlp", "data", "get", "test"])

Result: Clean logs, 200x faster data CLI tests

5. Simplified naming

  • Directory: tests/corpus/
  • Package: tests.corpus
  • Workflow: "Corpus Test"
  • Clear, concise, easy to type

Test coverage maintained: all parsing logic validated via representative mock data.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

Test cases of pythainlp.corpus downloads slowing down the development-test cycles. Some of the test cases are not reliable too, as sometimes they timed out and failed and make the whole test workflow failed.

Initial thought is that it involves network request (test corpus existence / download / query corpus catalog).

Try to improve the speed and stability of these corpus test cases, while maintain relatively high test coverage of the submodule.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits February 6, 2026 11:07
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve speed and stability of corpus test cases Mock corpus downloads in core tests for speed and stability Feb 6, 2026
Copilot AI requested a review from bact February 6, 2026 11:12
Copilot AI and others added 2 commits February 6, 2026 11:16
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
@bact bact added the tests Unit test, code coverage, test case label Feb 6, 2026
@bact bact added this to the 5.3 milestone Feb 6, 2026
Co-authored-by: bact <128572+bact@users.noreply.github.com>
@bact bact added the infrastructure build process, configuration, GitHub Actions label Feb 6, 2026
Copilot AI changed the title Mock corpus downloads in core tests for speed and stability Add dedicated corpus integrity testing workflow Feb 6, 2026
@coveralls
Copy link

coveralls commented Feb 6, 2026

Coverage Status

coverage: 65.333% (-0.02%) from 65.355%
when pulling 2a817fd on copilot/improve-corpus-test-speed
into d55e615 on dev.

Copilot AI and others added 2 commits February 6, 2026 11:27
…rence

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Add dedicated corpus integrity testing workflow Optimize corpus tests with mocking and add dedicated integrity workflow Feb 6, 2026
@bact bact added the refactoring a technical improvement which does not add any new features or change existing features. label Feb 6, 2026
…oads

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Optimize corpus tests with mocking and add dedicated integrity workflow Suppress CLI test output and mock corpus downloads Feb 6, 2026
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI requested a review from bact February 6, 2026 13:28
@bact
Copy link
Member

bact commented Feb 6, 2026

@copilot apply suggestions about documentation and tests in
#1291 (review)
#1291 (review)
#1291 (review)
#1291 (review)

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Optimize corpus tests: mock downloads, add catalog tests, suppress CLI output Optimize corpus tests: mock downloads, separate data validation, suppress CLI output Feb 6, 2026
@bact bact marked this pull request as ready for review February 6, 2026 19:44
bact added 7 commits February 6, 2026 19:52
Reworded the description of the corpus test suite for clarity and readability.
Add SPDX license and copyright information to README
Add SPDX license and copyright information to README
Fix formatting of the 'Run corpus test' section in README.
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 6, 2026

@bact bact merged commit 31c3b1a into dev Feb 6, 2026
23 of 24 checks passed
@bact bact deleted the copilot/improve-corpus-test-speed branch February 6, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure build process, configuration, GitHub Actions refactoring a technical improvement which does not add any new features or change existing features. tests Unit test, code coverage, test case

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments