Skip to content

Change MfList.fmt_string for floats to use numpy's string formatter#322

Merged
christianlangevin merged 1 commit into
modflowpy:developfrom
mwtoews:formatting
May 2, 2018
Merged

Change MfList.fmt_string for floats to use numpy's string formatter#322
christianlangevin merged 1 commit into
modflowpy:developfrom
mwtoews:formatting

Conversation

@mwtoews
Copy link
Copy Markdown
Contributor

@mwtoews mwtoews commented Apr 18, 2018

Stress data written by flopy inflates floating point precision, rather than preserving the native precision of the data type. This is due to a C-formatter tucked deep within flopy/utils/util_list.py. For example, a flux rate from a well package that is originally -8773.9 is effectively written by flopy as -8773.9004, introducing a difference of 0.0004. The good news is that these differences do not matter to single-precision floats (i.e. REAL or float32), but they are visually different when formatted, or used in double precision (or float64).

import numpy as np
x = np.float32(-8773.9)
print(str(x))  # -8773.9

# Here is what MfList.fmt_string currently does
print(' %15.7E' % (x,))  # '  -8.7739004E+03'

# Suggested fix, keep 1-space + 15-character string, but get numpy to format the float type to a string
print(' %15s' % (x,))  # '         -8773.9'

As shown above (and in this PR), the suggested fix is to just format a string for fixed-width floats. Numpy supports a wide range of different float types with different precisions, and they generally get formatted to string in a way that honors their precision.

There are possible consequences to this PR. For instance, files written by flopy will look different. E.g., a well package file that previously looks like this:

...
         1       135       158  -6.9352997E+01
         1       140       147  -9.3911003E+01
         1       140       150  -8.0982002E+01
         1       142       158  -1.6040001E+01
         1       143       158  -6.4769997E+01
         1       147       150  -2.0000000E-02
         1       151       157  -2.6480000E+03
         1       155       163  -9.4968002E+01
         1       166       167  -2.5591899E+03

would then look like this:

...
         1       135       158         -69.353
         1       140       147         -93.911
         1       140       150         -80.982
         1       142       158          -16.04
         1       143       158          -64.77
         1       147       150           -0.02
         1       151       157         -2648.0
         1       155       163         -94.968
         1       166       167        -2559.19

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 18, 2018

Coverage Status

Coverage increased (+0.003%) to 69.681% when pulling 97ff81c on mwtoews:formatting into d07fdb9 on modflowpy:develop.

@christianlangevin
Copy link
Copy Markdown

Hey Mike, thanks for looking into this. I see there is one failure, which I've looked into a little. I'm not sure how to move forward with this. The following is the first line of the river package for the original secp model, what flopy does now, and what flopy does with your proposed change (I removed some whitespace for clarity):

original
1 36 144 225.5 5778.1167089182 225.39999389648 6 2889.0583544591 1

flopy now
1 36 144 2.2550000E+02 5.7781167E+03 2.2539999E+02 6.0000000E+00 2.8890583E+03 1.0000000E+00

flopy proposed
1 36 144 225.5 5778.1167 225.4 6.0 2889.0583 1.0

It looks like the conductance value (and some others) might be problematic with your proposed changes. But its not entirely clear why, since a single precision version of MODFLOW wouldn't be able to store all those digits anyway.

I like where you are headed with this. Any thoughts?

@jtwhite79
Copy link
Copy Markdown
Contributor

If I remember right, we had a more general formatter but some people were reporting different results after loading writing rerunning. The culprit was the float formatter in MfList...

@mwtoews
Copy link
Copy Markdown
Contributor Author

mwtoews commented Apr 18, 2018

I've been inspecting a few versions of secp too, and there are very small differences in the output solutions. GMS natively uses double precision floats, so the original formatted files have more than sufficient precision. Ideally, rewritten files would allow a model to reproduce an exact flow solution, but with either case I see differences of heads around 1e-6.

The only way to "see" how these REALs are represented are in binary form using numpy's .tostring() method:

import numpy as np
import itertools

strings = ['225.39999389648', '2.2539999E+02', '225.4']

def process(x, dtype):
    valx = dtype(x)
    strx = str(valx)
    hexx = '0x' + valx.tostring().encode('hex')
    return valx, strx, hexx

for a_str, b_str in itertools.combinations(strings, 2):
    a, a_str, a_hex = process(a_str, np.float32)
    b, b_str, b_hex = process(b_str, np.float32)
    eq = '==' if a_hex == b_hex else '!='
    print('A: {} {} {} B: {} {}'.format(a_str, a_hex, eq, b_str, b_hex))
    
# A: 225.4 0x66666143 == B: 225.4 0x66666143
# A: 225.4 0x66666143 == B: 225.4 0x66666143
# A: 225.4 0x66666143 == B: 225.4 0x66666143

this example shows that all number combinations are the same for float32 types, so Fortran's REAL also should see them the same. (Repeating the example with float64 will make them each different.) I'm certain that somewhere in one of the several stress packages for secp there is a minor binary difference in one of these combinations that make a difference to the simulation.

I'll need to dig around a bit more to think up some new ideas...

@mwtoews
Copy link
Copy Markdown
Contributor Author

mwtoews commented May 1, 2018

I've finally made sense of the Travis CI results. There was one success with numpy >= 1.14.0 and most of the other failures were with older numpy versions. A very subtle and good change happened in this release, related to formatting floating point numbers:

Float printing now uses "dragon4" algorithm for shortest decimal representation
The str and repr of floating-point values (16, 32, 64 and 128 bit) are
now printed to give the shortest decimal representation which uniquely
identifies the value from others of the same type. Previously this was only
true for float64 values. The remaining float types will now often be shorter
than in numpy 1.13. Arrays printed in scientific notation now also use the
shortest scientific representation, instead of fixed precision as before.

The longer version of the Dragon4 implementation (primarily a comp-sci read) is described here. Since Steele and White (1990), several low-level floating-point formatters have been implemented and used across most of the compilers and software we use today, and they sometimes have subtle formatting differences.

What this means is that older numpy versions may have formatted float32 values a "bit off". E.g., comparing a value from line 246 of the original secp.chd GMS MODFLOW file:

import numpy as np
value = 3.1353258799746
print('0x' + np.float32(value).tostring().encode('hex'))  # 0x2ea94840

With older numpy versions, here is where it formats a "bad" version that has a different value:

str_val = str(np.float32(value))  # '3.13533'
print('0x' + np.float32(str_val).tostring().encode('hex'))  # 0x3fa94840

And numpy 1.14.0, this is a "good" version, which is the same in both formatted and in binary

str_val = str(np.float32(value))  # '3.135326'
print('0x' + np.float32(str_val).tostring().encode('hex'))  # 0x2ea94840

these differences matter to FORTRAN REAL types, thus the flow solution is different for the secp example.


So what does this mean? We can possibly switch to an improved floating-point precision for numpy >= 1.14, which looks better and preserves formatted floating point precision on a binary level. But older versions of numpy should stick to the current floating-point formatting.

@mwtoews
Copy link
Copy Markdown
Contributor Author

mwtoews commented May 2, 2018

I've rewritten this commit to adjust the formatting string, depending on the numpy version. Versions before numpy 1.14 preserve the '%15.7E' formatting as before, which will safely encode float32. And versions after 1.14.0 use numpy's newish built-in Dragon4 algorithm to format the shortest decimal representation which uniquely identifies the values.

All Travis CI tests (except the usual Python nightly) now pass, where each have a different version of numpy, currently:

  • Python 2.7 with numpy 1.13.3
  • Python 3.4 with numpy 1.12.1
  • Python 3.5 with numpy 1.14.0
  • Python 3.6 with numpy 1.13.3
  • Python 3.6-dev with numpy 1.14.3

@christianlangevin christianlangevin merged commit ec79251 into modflowpy:develop May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants