Unicode support cleanup by forejtv · Pull Request #868 · diffblue/cbmc

forejtv · 2017-04-25T10:58:32Z

Remove an unused utf16 to utf8 conversion function that is broken anyway, and tidy utf32 to utf8.
Change utf8 to utf16 conversion to not require codecvt.

This should make the PR #752 unnecessary.

forejtv · 2017-04-25T12:09:25Z

There was a problem with compilation on Windows, it is fixed now.

smowton

Some suggestions, and one significant problem (checking array bounds). It also sucks that we have to write this sort of stuff instead of just use a library, but I guess there's no getting around that.

smowton · 2017-04-25T13:06:56Z

src/util/unicode.cpp

More usual to use |

smowton · 2017-04-25T13:09:00Z

src/util/unicode.cpp

Rather than write "byte-swap least-significant word in dword", might be clearer to write "byte-swap word" and have the caller cast to the smaller type.

smowton · 2017-04-25T13:12:09Z

src/util/unicode.cpp

smowton · 2017-04-25T13:14:18Z

src/util/unicode.cpp

Probably worth noting in a comment that D800-DFFF are also passed through without comment

smowton · 2017-04-25T13:20:22Z

src/util/unicode.cpp

Should check array bounds when trying to continue reading a code-point, in case the input is not really UTF-8

tautschnig

I'll refrain from marking this approve/request changes for I've got way to many questions. Either way: Thank you to all that have contributed to this patch. I'll happily close #752 now.

tautschnig · 2017-04-25T16:46:08Z

src/util/unicode.cpp

Should this be marked "static"?

tautschnig · 2017-04-25T16:46:41Z

src/util/unicode.h

The above needs to be renamed or even removed (see my suggestion for marking it static).

I don't follow this comment.

Are you confusing this function, which still exist, with a function that got renamed to utf8_append_code?

I may well be - there is a function that has been renamed and I cannot see a corresponding change in the header file.

Yes, the renamed one was utf32_to_utf8(unsigned int c, std::string &result), this one stayed the same. I have made that one static.

tautschnig · 2017-04-25T16:48:27Z

src/util/unicode.cpp

May I hug you for this change? This obsoletes #752 if I'm right? And thus "COMPILING" should say we can actually build using GCC 4.9 again!? THANK YOU!!!

tautschnig · 2017-04-25T16:48:47Z

src/util/unicode.cpp

Maybe add a blank line after this one?

tautschnig · 2017-04-25T16:51:29Z

src/util/unicode.cpp

There are architectures where bytes (or characters?) aren't 8 bits wide (but 16 instead). That said: does this have to be a compile-time or rather a runtime check? Or is it even a target-analysis architecture property?

I think this should not be target-analysis architecture, but the architecture on which cbmc runs, because the underlying architecture determines how integers are endianed.

I don't follow the comment on bytes/chars. uint8_t should be 8 bits, or not?

Yes, uint8_t will be 8 bits, but I'm not entirely sure whether a char-is-16bits architecture would yield the expected answer here. You may have to go for uint32_t to make sure endianness affects byte order?

tautschnig · 2017-04-25T16:51:56Z

src/util/unicode.cpp

This looks like a 4-character indent? Applies below as well.

tautschnig · 2017-04-25T16:53:54Z

unit/Makefile

Lexicographic order of files, please.

tautschnig · 2017-04-25T16:55:50Z

src/util/unicode.cpp

Following up to my own comment above: do we actually have use cases for this function where the operating environment decides endianness, as well as cases where it's the program under analysis that matters (and thus information in config)?

As far as I understand the context (and how the original codecvt-based function was implemented), this has nothing to do with analysis endianness.

But anyway, this particular function seems to be never used, but I still implemented it as it was quite easy having the other one, which is used. And I was not sure if we might need this one in the future when we extend string support in cbmc.

git grep tells me:

java_bytecode/java_bytecode_typecheck_expr.cpp: utf8_to_utf16_little_endian(id2string(value))); solvers/refinement/string_constraint_generator_constants.cpp: str=utf8_to_utf16_little_endian(c_str); util/file_util.cpp: delete_directory_utf16(utf8_to_utf16_little_endian(path));

The last one is the runtime environment, which I believe we would thus know at compile team.

The first one processes analysis input into an internal format. I do not think that the internal format should vary across architectures, but instead should always be of a single endianness. (I believe internally we're actually big-endian. @kroening ?)

The middle case I'm not entirely sure about...

I lack deeper knowledge of the usecases, and the more I look at them the more I am confused. We use the little endian function in refinements, is there a reason for this (@smowton?). I thought this has something to do with Java strings, but a quick google search suggests that Java uses big endian utf16 no matter what the platform is.

The use case I originally wrote this for was to translate UTF-8 strings stored in Java class files into UTF-16 (in analysis host endian-ness) string literals. It doesn't matter to us that the JVM would use big-endian UTF-16-- rather that arithmetic on codepoints corresponds to arithmetic on the host doing the analysis. I suspect the code in https://github.com/diffblue/cbmc/blob/c8c9085f38317359bf175d8bb99ab16e4605c1e1/src/java_bytecode/java_bytecode_typecheck_expr.cpp only works on little-endian architectures as a result.

@smowton Should this be raised as a bug and dealt with in a separate PR? And is there any additional functionality needed for this in utils/unicode.cpp?

Probably, yes. There's already a to-big-endian function too so no extra work in unicode.cpp to my knowledge.

Do we thus really need to run-time check is_little_endian_arch - it seems we could get away with a compile-time check.

…utf32 to utf8.

A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.

forejtv requested review from smowton and tautschnig April 25, 2017 10:58

forejtv added do not merge and removed do not merge labels Apr 25, 2017

smowton suggested changes Apr 25, 2017

View reviewed changes

tautschnig reviewed Apr 25, 2017

View reviewed changes

tautschnig mentioned this pull request Apr 25, 2017

Re-enable compilation using GCC version 4.9 #752

Closed

forejtv added 2 commits April 25, 2017 18:23

Remove an unused conversion function that is broken anyway, and tidy …

549e589

…utf32 to utf8.

Change utf8 to utf16 conversion to not require codecvt.

49b77ce

A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.

kroening merged commit 065810f into diffblue:master Apr 27, 2017

tautschnig mentioned this pull request Apr 28, 2017

Review endianness configuration in unicode.cpp #878

Closed

Conversation

forejtv commented Apr 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forejtv commented Apr 25, 2017

Uh oh!

smowton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tautschnig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

forejtv commented Apr 25, 2017 •

edited

Loading