Skip to content

Conversation

@rouzazari
Copy link
Contributor

@rouzazari rouzazari commented Jan 12, 2017

Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

Bug description: A simple check of whether the prior char is a backslash
is insufficient because the backslash may itself be escaped.

A test is also included (previously included in #14693).

xref #14693
xref #15096

@codecov-io
Copy link

codecov-io commented Jan 12, 2017

Current coverage is 84.75% (diff: 100%)

No coverage report found for master at ab0d236.

Powered by Codecov. Last update ab0d236...92f4e01

pandas/lib.pyx Outdated
"""
cdef:
Py_ssize_t i = 0, num_open_brackets_seen = 0, in_quotes = 0, length
Py_ssize_t is_escaping = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make in_quotes and is_escaping of type bint

@jreback
Copy link
Contributor

jreback commented Jan 12, 2017

pls add a whatsnew note (0.20.0 bug fix section)

@jreback jreback added Bug IO JSON read_json, to_json, json_normalize labels Jan 12, 2017
@rouzazari rouzazari force-pushed the to_json_lines_with_escaping branch from 60bb9da to 92f4e01 Compare January 13, 2017 00:51
@rouzazari
Copy link
Contributor Author

Thanks, @jreback. I made the changes you requested, rebased to master, and squashed the commits.

Please note: I revised the test to include a test for escaped characters in keys and values (i.e. columns and data of dataframe). Previously I was only testing escaped characters in values/data.

- Require at least 0.23 version of cython to avoid problems with character encodings (:issue:`14699`)
- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`, :issue:`14982`)
- Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
- Bug in ``.to_json()`` where lines=True and contents (keys or values) contain escaped characters (:issue:`15096`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks around lines=True

self.assertEqual(result, expected)
assert_frame_equal(pd.read_json(result, lines=True), df)

# GH15096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 1-line comment on what you are testing here

@jreback jreback added this to the 0.20.0 milestone Jan 13, 2017
@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

lgtm, minor comments and some linting issues to fix. ping on green.

Updates existing to_json methodology by adding is_escaping variable, which ensures escaped chars are handled correctly.

- Includes test for escaped characters in keys and values (i.e. columns and data).
- Includes bug fix in whatsnew
- Revised type of in_quotes and is_escaping to bint

xref pandas-dev#14693
xref pandas-dev#15096
@rouzazari rouzazari force-pushed the to_json_lines_with_escaping branch from 92f4e01 to d114455 Compare January 13, 2017 18:20
@jreback jreback closed this in 1a18420 Jan 13, 2017
@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

thanks! nice concise PR! keep em coming

@rouzazari
Copy link
Contributor Author

Thanks, @jreback. I appreciate the reviews and feedback.

FYI: There is no change in runtime performance from the bug fix. This is somewhat obvious, but it was a good exercise for me, so I thought I'd share. I reran the following performance tests from #14408:

Master:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

This branch:

In [5]: df = DataFrame(dict([('float{0}'.format(i), np.random.randn(N)) for i in range(C)]))
In [6]: %timeit df.to_json('foo.json',orient='records',lines=True)
10 loops, best of 3: 135 ms per loop

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

@rouzazari great!

note that we already have some asv's for perf testing, I am not sure how much they cover for json though, you can have a look in pandas/asv/benchmarks/packers.py

if you think some cases need more, then please PR!

@rouzazari
Copy link
Contributor Author

Got it. I'm not too familiar with this area, but I'll take a look.

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Updates existing to_json methodology by adding is_escaping variable,
which ensures escaped chars are handled correctly.

xref pandas-dev#14693

closes pandas-dev#15096

Author: Rouz Azari <rouz.azari@gmail.com>

Closes pandas-dev#15117 from rouzazari/to_json_lines_with_escaping and squashes the following commits:

d114455 [Rouz Azari] BUG: Fix to_json lines with escaped characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug IO JSON read_json, to_json, json_normalize

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_json() line separation broken by backslash in content

3 participants