Skip to content

Simplified WARC Reader#3

Merged
mfoy merged 1 commit intomasterfrom
delimiter-read
Oct 18, 2013
Merged

Simplified WARC Reader#3
mfoy merged 1 commit intomasterfrom
delimiter-read

Conversation

@gthole
Copy link
Copy Markdown

@gthole gthole commented Oct 17, 2013

Don't use Content-Length header to read content in.

Potential issues:

  • It uses the "WARC/1.0" line stamp as a delimiter. If a record has "\nWARC/1.0" in the response body somewhere, this could cause issues. That seems like an outlandish outlier, though, in comparison with the grief we've had with Content-Length reads.
  • Interface isn't as pretty. Have to:
from warc.warc import SimpleWARCReader
reader = SimpleWARCReader(open('path/to/file.warc', 'r'))
  • Can't read gzipped files (but then the original library can't either, not really.)

@ghost ghost assigned mfoy Oct 17, 2013
mfoy added a commit that referenced this pull request Oct 18, 2013
@mfoy mfoy merged commit e03af6d into master Oct 18, 2013
@mfoy mfoy deleted the delimiter-read branch October 18, 2013 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants