fix: improve CSV header validation and error messages#692
fix: improve CSV header validation and error messages#692zkaoudi merged 4 commits intoapache:mainfrom
Conversation
|
Thanks for your contribution! Left some comments for your PR, I prefer if we keep the accessors as-is so we don't accidentally deprecate anything people might be using. |
…d revert streamLines to its original form
...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java
Outdated
Show resolved
Hide resolved
...pi/wayang-api-sql/src/main/java/org/apache/wayang/api/sql/sources/fs/JavaCSVTableSource.java
Outdated
Show resolved
Hide resolved
|
I checked with the right file and it works. I also checked with an incorrect file which has this heading: And got this error message: "Column count mismatch in CSV file 'file:///Users/zoi/Work/WAYANG/wayang-examples/src/main/resources/input/customers.csv': expected 1 columns but found 4 (separator ';'). Line: '1;Alice Johnson;alice@example.com;USA'. Ensure the header uses 'name:type' format with commas and data rows use ';' as delimiter." The line it prints is the second line of my file, not the header line. Can we print the header line? |
|
I've updated the PR to handle this case. With the CSV file you shared using the header
It now prints the header line instead of the data line and clearly tells the user what to fix. |
| * @param path the filesystem path to the CSV file | ||
| */ | ||
| private void validateHeaderLine(final String path) { | ||
| final FileSystem fileSystem = FileSystems.getFileSystem(path).orElseThrow( |
There was a problem hiding this comment.
Would it be possible to do the check directly when we are reading the file? We are now opening the file twice which could be costly?
There was a problem hiding this comment.
@zkaoudi since streamLines() is static and is the place where file opening and iterator creation are already defined, I considered it the appropriate location to perform header validation. However, because it is static, it cannot access the instance-level separator, and we cannot modify its signature or behavior to pass the separator or expose the header.
Given these constraints, performing header validation within the same file-open operation would require changing streamLines(), which @mspruc wanted to avoid to preserve the existing definition. As a result, the file is currently opened twice.
Is there something we can do here to avoid the double file open while keeping the existing structure intact? I’d appreciate your guidance.
There was a problem hiding this comment.
why do we need to access the separator? the separator for the header should always be a comma, right?
There was a problem hiding this comment.
You're right, I got confused here the header separator is always a comma, so the core validation doesn't actually need the instance-level separator. I was using it only for a more descriptive error message, but that's not essential.
|
I've refactored Here are the error messages for each case: 1. Empty CSV file2. Header missing types (e.g.,
|
zkaoudi
left a comment
There was a problem hiding this comment.
looks good! thank you :)
Summary
Closes #690
Improves CSV error handling in the SQL API filesystem source (
JavaCSVTableSource) to provide clear, actionable error messages when CSV files are malformed or misconfigured.Changes
streamLinesto read and validate header in a single file open: The header line is consumed viaiterator.next()and validated before streaming data rows, avoiding opening the file twice.validateHeaderLinemethod (static): Validates the CSV header before data parsing — checks that each column follows thename:typeformat and that the comma-separated column count matches the number ofname:typepairs (detecting wrong separator usage).parseLine: Distinct data row error showing expected vs actual column count, the separator used, and the offending line.streamLines: Throws a clear error if the CSV file has no lines at all.Context
Calcite's CSV adapter requires a typed header row (e.g.,
id:int,name:string,email:string) using commas, while data rows use Wayang's configurable separator (default;). Since the header always uses commas, header validation isstaticand doesn't need the instance-levelseparator.Without proper validation, the previous errors were unclear and came too late (during data parsing). The new errors clearly explain the issue at the right stage:
CSV file '...' is empty. Expected a header row (e.g., 'id:int,name:string').CSV file '...': header column 'NAMEA' missing required type. Expected 'name:type' format (e.g., 'id:int'). Header: 'NAMEA,NAMEB,NAMEC'.CSV file '...': column count mismatch. Expected 4 comma-separated 'name:type' columns but found 1. Header: 'id:int;name:string;email:string;country:string'.CSV file '...': data row has 2 columns but expected 3 (separator ';'). Line: 'test1;1'.