Skip to content

Comments

Use regular expressions to parse image data text files.#1971

Open
erictzeng wants to merge 1 commit intoBVLC:masterfrom
erictzeng:convert_imageset_spaces
Open

Use regular expressions to parse image data text files.#1971
erictzeng wants to merge 1 commit intoBVLC:masterfrom
erictzeng:convert_imageset_spaces

Conversation

@erictzeng
Copy link
Contributor

Fixes #1951.

This pull request consists of two changes:

  1. Rather than the brittle ifstream method of parsing image data files, this pull request uses regular expressions for more robust matching.
  2. Previously, the parsing code was duplicated across two files, tools/convert_imageset.cpp and src/caffe/layers/image_data_layer.cpp. This pull request pulls that common code out into a new function in src/caffe/util/io.cpp for ease of maintenance.

More details follow.

Each line of the input text file is matched against the following regular expression:

\h*("?)(.+?)\1\h+(\d+)\h*

Feel free to play around with an interactive version so you can test it out and see what it matches. This regular expression handles a lot of cases that would've been difficult to handle using the previous naive approach. It captures whitespace within a filename, and enables quoting of filenames in case for some insane reason you have a space at the beginning of a file name.

Some concrete examples of really degenerate cases that will parse correctly:

file name with spaces.jpg 1
" file_name_with_leading_space.jpg" 2
file_name_with_"_symbol.jpg 3
" really disgusting " file  ""name  .jpg" 4

One drawback is that this introduces boost_regex as an additional dependency. However, since we already require Boost, this seems like an acceptable tradeoff.

Implementation-wise, this pull request should be complete, though it's lacking tests, which I will get around to writing at some point in the near future.

@shelhamer
Copy link
Member

@erictzeng this looks right -- thanks for fixing the brittle format -- but I think you need to update the travis script to install boost regex: https://github.com/BVLC/caffe/blob/master/scripts/travis/travis_install.sh.

@bchu
Copy link
Contributor

bchu commented Mar 30, 2016

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert_imageset doesn't handle file names with spaces

3 participants