Integer Type System: Refactor by ax3l · Pull Request #337 · openPMD/openPMD-api

ax3l · 2018-09-07T10:05:57Z

Refactor the integer type system to support all fundamental C/C++ integer types: short, int, long, long long plus the unsigned versions of those.

Fixed size and "least size" ints are aliases of the above, leading to issues on MSVC since there is no fixed int alias for long in the old design.
Another detail is that uint64_t is an alias for long on 64bit Linux but an alias for long long on 64bit OSX. On both platforms, long and long long are 64 bit.

User Changes

Users will see no changes as long as they use helpers such as determineDatatype<T>(). Otherwise, they must be aware that Datatypes such as DT::INT32 are now in one of DT::SHORT, DT::INT, etc. Types such as int32_t can still be used for attributes and data chunks, since they are simple aliases.

Backend Changes

Instead of implementing types such as INT16, INT32, INT64, ... one now needs to implement the four types short, int, long and long long. Throw runtime errors on missing types, e.g. ADIOS 1.13.1 does not support long long yet.

The fileformat is responsible for portability, e.g. HDF5 stores automatically platform information with the stored types, making sure a stored long on one platform is an int32_t on one and and int64_t on another platform (and will not show up as long on both, potentially).

We're very likely going to have problems with ADIOS in this regard.
From what I can see, they do NOT make the destinction that causes problems with OSX and MSVC. Rather, they treat the bit-widths as I did initally (see adios_types.h, which explicitly lists sizes of the datatypes).
As e.g. int may now be either 16 or 32 bit, and ADIOS assumes a size of 32, this might explode in our face.

Yep, I know but that's ADIOS' portability problem not ours: ornladios/ADIOS#187

(Side note: ADIOS1 only compiles on Linux and OSX, not on MSVC.)

Uh wait, you mean we will not be able to store a (u)int64_t since ADIOS has no long long... Yeah... Well.

that's ADIOS' portability problem not ours

So the baseline here is to rely on the C standard and ignore potentially faulty behaviour in one of our backends. Not ideal, but probably the only thing we ~~can~~ should do.

you mean we will not be able to store a (u)int64_t

We most likely can, unless we're on exotic platforms (MSVC, anyone?) where LONG and ULONG are 32 Bits wide... Which again comes back to the standard compliance stated above.

What I try to say is: if adios_long is defined as 8 byte on all platforms, we will implement it as such and convert properly in the backend from whatever matches (int or long or long long).

The fileformat is responsible for portability, e.g. HDF5 stores automatically platform information with the stored types

Hi, just so I get this right: When reading e.g. a long from some stored file, I will also have to take into account the platform that wrote the value and use the reading platform's datatype that corresponds with the actual length of the value stored, so possibly not a long but an int for example?
If so, it's good you mention this, because floating point datatypes already have the same issue and I ignored that so far.

I will also have to take into account the platform

My idea is to handle it exactly the other way around. Our backends, such as HDF5, are portable across platforms. If HDF5 stores for example a long (aka int64_t) on x86-64 Linux it does store in the back automatically the type of the platform. If you read this file back on x86-64 Windows, the HDF type presented to you will be long long (still aka int64_t) since int (and long) are int32_t there. The precision of the data is preserved.

JSON

Let's migrate that discussion here: #65

ax3l · 2018-09-07T13:05:07Z

Note: The leftover HDF5 CI errors on OSX and MSVC are "just" the shape type for constant records.

Note2: we might have to fake types in the ADIOS1 backend on OSX in case it's really assuming fixed sizes behind it's adios_int/long/... types.

anokfireball · 2018-09-07T13:08:58Z

src/ParticlePatches.cpp


        using DT = Datatype;
-        if( DT::UINT64 != *dOpen.dtype )
+        if( DT::ULONG != *dOpen.dtype )


Correct me if I'm wrong, but this does not strictly enforce the standard, ULONG may be 32 or 64.

if( determineDatatype< uint64_t >() != *dOpen.dtype )

exactly, that's the leftover from my replace marathon and your suggestion is correct!

anokfireball · 2018-09-07T13:11:36Z

src/Series.cpp

    IOHandler->enqueue(IOTask(this, aRead));
    IOHandler->flush();
-    if( *aRead.dtype == DT::UINT32 )
+    if( *aRead.dtype == DT::UINT )


if( *aRead.dtype == determineDatatype< uint32_t >() )

anokfireball · 2018-09-07T13:15:08Z

src/binding/python/BaseRecordComponent.cpp

+            else if( brc.getDatatype() == DT::UINT )
                return py::dtype("uint32");
-            else if( brc.getDatatype() == DT::UINT64 )
+            else if( brc.getDatatype() == DT::ULONG )


This is now probably not precise enough as ULONG may be 32 or 64.
Numpy provides a number of compatible C types, but not all of the applicable ones:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.scalars.html#built-in-scalar-types

I see you recorgnized the problem in RecordComponent.cpp already. 👍

Yep, pls feel free to annotate any line you catch :)

anokfireball · 2018-09-07T13:22:22Z

test/SerialIOTest.cpp

        RecordComponent& e_positionOffset_x = e_positionOffset["x"];
        REQUIRE(e_positionOffset_x.unitSI() == 2.599999993753294e-07);
-        REQUIRE(e_positionOffset_x.getDatatype() == Datatype::INT32);
+        REQUIRE(e_positionOffset_x.getDatatype() == Datatype::INT);


Same problem for all datatypes read back from a file.
INT may be 16, 32 or 64.

REQUIRE(e_positionOffset_x.getDatatype() == determineDatatype< int32_t >());

src/IO/ADIOS/CommonADIOS1IOHandler.cpp

anokfireball · 2018-09-09T14:08:44Z

src/binding/python/BaseRecordComponent.cpp

+                return py::dtype("int");
+            // missing in numpy: covered by uint or ulonglong
+            // else if( brc.getDatatype() == DT::LONG )
+            //     return py::dtype("long");


Now this is weird.
https://docs.scipy.org/doc/numpy/user/basics.types.html

int_ | Default integer type (same as C long; normally either int64 or int32)

https://docs.scipy.org/doc/numpy-1.14.2/reference/arrays.scalars.html

int_ | compatible: Python int | 'l'

https://docs.python.org/3/c-api/long.html
(Note that there's two meanings of longhere: (a) C long (b) Pythons's 'bignum' integer of arbitrary precision)

PyLong_FromLong(long v)

Note the lacking PyLong_FromInt(int v)

So there seems to be an equivalent of long in int_, but no counterpart to unsigned long.

https://docs.scipy.org/doc/numpy/user/basics.types.html#array-types-and-conversions-between-types

Additionally to intc the platform dependent C integer types short, long, longlong and their unsigned versions are defined.

Quick test:

In [1]: import numpy as np In [2]: np.dtype('long') Out[2]: dtype('int64')

Haven't figured out how unsigned long works yet.

numpy/numpy#10678 (comment)

Docs are wrong. The C integer type long is called np.int_, and the integer type unsigned long is called np.uint

So there we have it:
C int/ unsigned int is Python intc/uintc
C long/unsigned long is Python int_/ uint

In [1]: import numpy as np In [2]: np.dtype('short') Out[2]: dtype('int16') In [3]: np.dtype('intc') Out[3]: dtype('int32') In [4]: np.dtype('int_') Out[4]: dtype('int64') In [5]: np.dtype('longlong') Out[5]: dtype('int64')

Wow, as fancy as my float-confusion in Python xD Thanks a ton for checking!

Ah yes, should be fixed now together with my float question in numpy's docs: numpy/numpy#11837

anokfireball · 2018-09-09T14:10:20Z

src/binding/python/RecordComponent.cpp

+            else if( r.getDatatype() == Datatype::SHORT ) dtype = py::dtype("short");
+            else if( r.getDatatype() == Datatype::INT ) dtype = py::dtype("int");
+            // missing in numpy: covered by int or longlong
+            // else if( r.getDatatype() == Datatype::LONG ) dtype = py::dtype("long");


See comment.

anokfireball · 2018-09-09T14:23:51Z

src/binding/python/RecordComponent.cpp

+            else if( a.dtype().is(py::dtype("int")) )
+                r.storeChunk( offset, extent, shareRaw( (int*)a.mutable_data() ) );
+            // missing in numpy: covered by int or longlong
+            // else if( a.dtype().is(py::dtype("long")) )


See comment.

anokfireball · 2018-09-09T14:33:46Z

test/CoreTest.cpp

+        REQUIRE(Datatype::INT == a.dtype);
+        a = Attribute(static_cast< int >(0));
+        REQUIRE(Datatype::LONG == a.dtype);
+    }


Scary thought:
There might be platforms where

sizeof(short) > 2u

making it impossible to write or read 16 Bit wide data (as int16_t does not alias to any of our provided integer types).

Hm, that would be weird. But if a platform does not implement a 2 byte fundamental integer type, there will also be no alias int16_t for it.

Just to throw out this SO answer again, this is present for char on certain (albeit rare) systems.

anokfireball

First of all, excuse the avalanche of comments.
Python dtypes need adjustment and a few questions need to be resolved.

Refactor the integer type system to support all fundamental C/C++ integer types: `short`, `int`, `long`, `long long` plus the unsigned versions of those. Fixed size ints are aliases of the above, leading to issues on OSX and MSVC since there is no fixed int alias for "long" in the old design.

ax3l · 2018-09-11T07:24:41Z

(just pushing again since travis forgot to report its success state)

Besides returning true for the same types, identical implementations on some platforms, e.g. if long and long long are the same or double and long double will also return true. Affected by https://stackoverflow.com/questions/44515148/why-is-operator-overload-of-enum-ambiguous-in-msvc on MSVC.

ax3l added bug affects latest release refactoring api: breaking breaking API changes labels Sep 7, 2018

ax3l assigned anokfireball Sep 7, 2018

ax3l requested a review from anokfireball September 7, 2018 10:05

ax3l force-pushed the topic-intTypeSystem branch 5 times, most recently from 66d2f5b to b1aefcf Compare September 7, 2018 11:09

ax3l force-pushed the topic-intTypeSystem branch 3 times, most recently from 8cf1fa1 to 46eef88 Compare September 7, 2018 11:43

ax3l added frontend: C++17 frontend: Python3 backend labels Sep 7, 2018

ax3l force-pushed the topic-intTypeSystem branch from 46eef88 to 56f91d6 Compare September 7, 2018 12:38

anokfireball reviewed Sep 7, 2018

View reviewed changes

ax3l force-pushed the topic-intTypeSystem branch 3 times, most recently from 10bc435 to 9d421be Compare September 8, 2018 08:01

ax3l mentioned this pull request Sep 8, 2018

JSON #65

Closed