Unicode support cleanup#868
Unicode support cleanup#868kroening merged 2 commits intodiffblue:masterfrom forejtv:feature/unicode-cleanup
Conversation
|
There was a problem with compilation on Windows, it is fixed now. |
smowton
left a comment
There was a problem hiding this comment.
Some suggestions, and one significant problem (checking array bounds). It also sucks that we have to write this sort of stuff instead of just use a library, but I guess there's no getting around that.
src/util/unicode.cpp
Outdated
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Rather than write "byte-swap least-significant word in dword", might be clearer to write "byte-swap word" and have the caller cast to the smaller type.
src/util/unicode.cpp
Outdated
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Probably worth noting in a comment that D800-DFFF are also passed through without comment
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Should check array bounds when trying to continue reading a code-point, in case the input is not really UTF-8
tautschnig
left a comment
There was a problem hiding this comment.
I'll refrain from marking this approve/request changes for I've got way to many questions. Either way: Thank you to all that have contributed to this patch. I'll happily close #752 now.
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Should this be marked "static"?
src/util/unicode.h
Outdated
There was a problem hiding this comment.
The above needs to be renamed or even removed (see my suggestion for marking it static).
There was a problem hiding this comment.
I don't follow this comment.
Are you confusing this function, which still exist, with a function that got renamed to utf8_append_code?
There was a problem hiding this comment.
I may well be - there is a function that has been renamed and I cannot see a corresponding change in the header file.
There was a problem hiding this comment.
Yes, the renamed one was utf32_to_utf8(unsigned int c, std::string &result), this one stayed the same. I have made that one static.
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
May I hug you for this change? This obsoletes #752 if I'm right? And thus "COMPILING" should say we can actually build using GCC 4.9 again!? THANK YOU!!!
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Maybe add a blank line after this one?
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
There are architectures where bytes (or characters?) aren't 8 bits wide (but 16 instead). That said: does this have to be a compile-time or rather a runtime check? Or is it even a target-analysis architecture property?
There was a problem hiding this comment.
I think this should not be target-analysis architecture, but the architecture on which cbmc runs, because the underlying architecture determines how integers are endianed.
I don't follow the comment on bytes/chars. uint8_t should be 8 bits, or not?
There was a problem hiding this comment.
Yes, uint8_t will be 8 bits, but I'm not entirely sure whether a char-is-16bits architecture would yield the expected answer here. You may have to go for uint32_t to make sure endianness affects byte order?
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
This looks like a 4-character indent? Applies below as well.
unit/Makefile
Outdated
There was a problem hiding this comment.
Lexicographic order of files, please.
src/util/unicode.cpp
Outdated
There was a problem hiding this comment.
Following up to my own comment above: do we actually have use cases for this function where the operating environment decides endianness, as well as cases where it's the program under analysis that matters (and thus information in config)?
There was a problem hiding this comment.
As far as I understand the context (and how the original codecvt-based function was implemented), this has nothing to do with analysis endianness.
But anyway, this particular function seems to be never used, but I still implemented it as it was quite easy having the other one, which is used. And I was not sure if we might need this one in the future when we extend string support in cbmc.
There was a problem hiding this comment.
git grep tells me:
java_bytecode/java_bytecode_typecheck_expr.cpp: utf8_to_utf16_little_endian(id2string(value)));
solvers/refinement/string_constraint_generator_constants.cpp: str=utf8_to_utf16_little_endian(c_str);
util/file_util.cpp: delete_directory_utf16(utf8_to_utf16_little_endian(path));
The last one is the runtime environment, which I believe we would thus know at compile team.
The first one processes analysis input into an internal format. I do not think that the internal format should vary across architectures, but instead should always be of a single endianness. (I believe internally we're actually big-endian. @kroening ?)
The middle case I'm not entirely sure about...
There was a problem hiding this comment.
I lack deeper knowledge of the usecases, and the more I look at them the more I am confused. We use the little endian function in refinements, is there a reason for this (@smowton?). I thought this has something to do with Java strings, but a quick google search suggests that Java uses big endian utf16 no matter what the platform is.
There was a problem hiding this comment.
The use case I originally wrote this for was to translate UTF-8 strings stored in Java class files into UTF-16 (in analysis host endian-ness) string literals. It doesn't matter to us that the JVM would use big-endian UTF-16-- rather that arithmetic on codepoints corresponds to arithmetic on the host doing the analysis. I suspect the code in https://github.com/diffblue/cbmc/blob/c8c9085f38317359bf175d8bb99ab16e4605c1e1/src/java_bytecode/java_bytecode_typecheck_expr.cpp only works on little-endian architectures as a result.
There was a problem hiding this comment.
@smowton Should this be raised as a bug and dealt with in a separate PR? And is there any additional functionality needed for this in utils/unicode.cpp?
There was a problem hiding this comment.
Probably, yes. There's already a to-big-endian function too so no extra work in unicode.cpp to my knowledge.
There was a problem hiding this comment.
Do we thus really need to run-time check is_little_endian_arch - it seems we could get away with a compile-time check.
A unit test is added to check (on a few instances) equivalence with the original implementation. Decrease the required gcc version.
This should make the PR #752 unnecessary.