-
Notifications
You must be signed in to change notification settings - Fork 535
Description
(this issue is for a clearly defined, short term goal; not another generic "improve xxx" issue;)
The original CSV parser was purposefully restrictive; strict formatting - one line per observation (no new lines in fields), fixed number of commas per line, etc. These requirements are no longer relevant. At the same time Gary specifically requested the CSV ingest to handle full text - with escaped new lines and rich punctuation, etc. Example: the file posts_all.tab (csv original) in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QSZMPD.
One way to define the goal would be to say that any Google/Excel spreadsheet columns exported as CSV should be parseable by our ingest. (I will add more details on how they escape punctuation characters and such).
A sensible way to achieve this would be to switch to some available open source parser (Apache seems like a good candidate), rather than maintaining our own.