fix: improve CSV header validation and error messages by Prathamesh9284 · Pull Request #692 · apache/wayang

Prathamesh9284 · 2026-02-17T18:57:59Z

Summary

Closes #690

Improves CSV error handling in the SQL API filesystem source (JavaCSVTableSource) to provide clear, actionable error messages when CSV files are malformed or misconfigured.

Changes

Refactored streamLines to read and validate header in a single file open: The header line is consumed via iterator.next() and validated before streaming data rows, avoiding opening the file twice.
Added validateHeaderLine method (static): Validates the CSV header before data parsing — checks that each column follows the name:type format and that the comma-separated column count matches the number of name:type pairs (detecting wrong separator usage).
Improved error message in parseLine: Distinct data row error showing expected vs actual column count, the separator used, and the offending line.
Added empty file detection in streamLines: Throws a clear error if the CSV file has no lines at all.

Context

Calcite's CSV adapter requires a typed header row (e.g., id:int,name:string,email:string) using commas, while data rows use Wayang's configurable separator (default ;). Since the header always uses commas, header validation is static and doesn't need the instance-level separator.

Without proper validation, the previous errors were unclear and came too late (during data parsing). The new errors clearly explain the issue at the right stage:

CSV file '...' is empty. Expected a header row (e.g., 'id:int,name:string').
CSV file '...': header column 'NAMEA' missing required type. Expected 'name:type' format (e.g., 'id:int'). Header: 'NAMEA,NAMEB,NAMEC'.
CSV file '...': column count mismatch. Expected 4 comma-separated 'name:type' columns but found 1. Header: 'id:int;name:string;email:string;country:string'.
CSV file '...': data row has 2 columns but expected 3 (separator ';'). Line: 'test1;1'.

mspruc · 2026-02-18T07:36:05Z

Thanks for your contribution! Left some comments for your PR, I prefer if we keep the accessors as-is so we don't accidentally deprecate anything people might be using.

…d revert streamLines to its original form

Prathamesh9284 · 2026-02-18T10:46:39Z

Hi @mspruc @zkaoudi

I’ve updated the PR based on your feedback to keep the existing methods and signatures the same. I moved the validateHeaderLine logic to createStream instead of changing streamLines.

Please let me know if anything else needs to be improved.

mspruc

@zkaoudi can you verify this with your example?

...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java

zkaoudi · 2026-02-18T13:51:03Z

I checked with the right file and it works.

I also checked with an incorrect file which has this heading:
id:int;name:string;email:string;country:string

And got this error message: "Column count mismatch in CSV file 'file:///Users/zoi/Work/WAYANG/wayang-examples/src/main/resources/input/customers.csv': expected 1 columns but found 4 (separator ';'). Line: '1;Alice Johnson;alice@example.com;USA'. Ensure the header uses 'name:type' format with commas and data rows use ';' as delimiter."

The line it prints is the second line of my file, not the header line. Can we print the header line?

Prathamesh9284 · 2026-02-18T14:38:18Z

Hi @zkaoudi @mspruc,

I've updated the PR to handle this case. With the CSV file you shared using the header id:int;name:string;email:string;country:string, the error now correctly identifies the header issue:

CSV file 'file:///Users/zoi/Work/WAYANG/wayang-examples/src/main/resources/input/customers.csv': header uses ';' as separator, but Calcite requires commas. Header: 'id:int;name:string;email:string;country:string'. Expected format: 'id:int,name:string,email:string,country:string'.

It now prints the header line instead of the data line and clearly tells the user what to fix.

zkaoudi · 2026-02-19T13:37:02Z

...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java

+     * @param path the filesystem path to the CSV file
+     */
+    private void validateHeaderLine(final String path) {
+        final FileSystem fileSystem = FileSystems.getFileSystem(path).orElseThrow(


Would it be possible to do the check directly when we are reading the file? We are now opening the file twice which could be costly?

@zkaoudi since streamLines() is static and is the place where file opening and iterator creation are already defined, I considered it the appropriate location to perform header validation. However, because it is static, it cannot access the instance-level separator, and we cannot modify its signature or behavior to pass the separator or expose the header.

Given these constraints, performing header validation within the same file-open operation would require changing streamLines(), which @mspruc wanted to avoid to preserve the existing definition. As a result, the file is currently opened twice.

Is there something we can do here to avoid the double file open while keeping the existing structure intact? I’d appreciate your guidance.

why do we need to access the separator? the separator for the header should always be a comma, right?

You're right, I got confused here the header separator is always a comma, so the core validation doesn't actually need the instance-level separator. I was using it only for a more descriptive error message, but that's not essential.

…le twice

Prathamesh9284 · 2026-02-23T10:05:32Z

Hi @zkaoudi @mspruc,

I've refactored validateHeaderLine to be static and moved it into streamLines(), so the file is only opened once. The header is consumed via the iterator before streaming data rows.

Here are the error messages for each case:

1. Empty CSV file

CSV file 'customers.csv' is empty. Expected a header row (e.g., 'id:int,name:string').

2. Header missing types (e.g., `NAMEA,NAMEB,NAMEC`)

CSV file 'customers.csv': header column 'NAMEA' missing required type. Expected 'name:type' format (e.g., 'id:int'). Header: 'NAMEA,NAMEB,NAMEC'.

3. Header uses wrong separator (e.g., `id:int;name:string;email:string;country:string`)

CSV file 'customers.csv': column count mismatch. Expected 4 comma-separated 'name:type' columns but found 1. Header: 'id:int;name:string;email:string;country:string'.

4. Data row has wrong number of columns (e.g., `test1;1` in a 3-column table)

CSV file 'customers.csv': data row has 2 columns but expected 3 (separator ';'). Line: 'test1;1'.

zkaoudi

looks good! thank you :)

fix: improve CSV header validation and error messages

19cbcab

refactor: move validateHeaderLine from streamLines to createStream an…

788e91a

…d revert streamLines to its original form

mspruc previously approved these changes Feb 18, 2026

View reviewed changes

...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java Outdated Show resolved Hide resolved

...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java Outdated Show resolved Hide resolved

fix: enhance CSV header validation to ensure correct separator usage

cb454b7

Prathamesh9284 dismissed mspruc’s stale review via cb454b7 February 18, 2026 14:32

Prathamesh9284 requested a review from mspruc February 19, 2026 05:15

mspruc previously approved these changes Feb 19, 2026

View reviewed changes

zkaoudi reviewed Feb 22, 2026

View reviewed changes

refactor: move header validation into streamLines to avoid opening fi…

455a05a

…le twice

Prathamesh9284 dismissed mspruc’s stale review via 455a05a February 23, 2026 10:03

Prathamesh9284 requested a review from zkaoudi February 23, 2026 10:06

zkaoudi approved these changes Feb 23, 2026

View reviewed changes

zkaoudi merged commit 8d8930a into apache:main Feb 23, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve CSV header validation and error messages#692

fix: improve CSV header validation and error messages#692
zkaoudi merged 4 commits intoapache:mainfrom
Prathamesh9284:fix/csv-header-validation

Prathamesh9284 commented Feb 17, 2026 •

edited

Loading

Uh oh!

mspruc commented Feb 18, 2026

Uh oh!

Prathamesh9284 commented Feb 18, 2026

Uh oh!

mspruc left a comment

Uh oh!

Uh oh!

Uh oh!

zkaoudi commented Feb 18, 2026

Uh oh!

Prathamesh9284 commented Feb 18, 2026 •

edited

Loading

Uh oh!

zkaoudi Feb 19, 2026

Uh oh!

Prathamesh9284 Feb 22, 2026 •

edited

Loading

Uh oh!

zkaoudi Feb 23, 2026

Uh oh!

Prathamesh9284 Feb 23, 2026

Uh oh!

Prathamesh9284 commented Feb 23, 2026

Uh oh!

zkaoudi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Prathamesh9284 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Context

Uh oh!

mspruc commented Feb 18, 2026

Uh oh!

Prathamesh9284 commented Feb 18, 2026

Uh oh!

mspruc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zkaoudi commented Feb 18, 2026

Uh oh!

Prathamesh9284 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zkaoudi Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Prathamesh9284 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zkaoudi Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Prathamesh9284 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Prathamesh9284 commented Feb 23, 2026

1. Empty CSV file

2. Header missing types (e.g., NAMEA,NAMEB,NAMEC)

3. Header uses wrong separator (e.g., id:int;name:string;email:string;country:string)

4. Data row has wrong number of columns (e.g., test1;1 in a 3-column table)

Uh oh!

zkaoudi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Prathamesh9284 commented Feb 17, 2026 •

edited

Loading

Prathamesh9284 commented Feb 18, 2026 •

edited

Loading

Prathamesh9284 Feb 22, 2026 •

edited

Loading

2. Header missing types (e.g., `NAMEA,NAMEB,NAMEC`)

3. Header uses wrong separator (e.g., `id:int;name:string;email:string;country:string`)

4. Data row has wrong number of columns (e.g., `test1;1` in a 3-column table)