performance of loading tabular mf6 datasets

consider this model, which has a well package with 1M fluxes:
```python
import time
import cProfile
import numpy as np
import pandas as pd
import flopy

model_name = 'junk'
workspace = 'tmp'
nper = 100

sim = flopy.mf6.MFSimulation(sim_name=model_name, version='mf6', exe_name='mf6', 
                             sim_ws=workspace)
tdis_rc = [(1.0, 1, 1.0)] * nper
tdis = flopy.mf6.ModflowTdis(sim, 
                             nper=nper, perioddata=tdis_rc)
gwf = flopy.mf6.ModflowGwf(sim, modelname=model_name,
                           model_nam_file='{}.nam'.format(model_name))
dis = flopy.mf6.ModflowGwfdis(gwf, nlay=1, nrow=2, ncol=2)
nwells = 10000
spd = {}
for kper in range(100):
    kspd = flopy.mf6.ModflowGwfwel.stress_period_data.empty(gwf,
                                                      maxbound=nwells,
                                                      boundnames=True)[0]
    kspd['cellid'] = [(0, 0, 0)] * nwells
    kspd['q'] = np.random.randn(nwells)
    kspd['boundnames'] = 'well'
    spd[kper] = kspd
wel = flopy.mf6.ModflowGwfwel(gwf, pname='wel', #maxbound=nwells,
                              stress_period_data=spd,
                              save_flows=True)
sim.write_simulation()
```
if you profile loading the model back in, you can see that almost the entire time (50 sec on my machine) is spent on the `_load_list_line` (1M calls) and `_append_data_list` (2M calls) methods in mffileaccess.py. (the load actually errors out, but the time is similar to what I got for a real-life example, suggesting that it was almost done).

```python
pr = cProfile.Profile()
pr.enable()
flopy.mf6.MFSimulation.load('junk', sim_ws='tmp')
pr.disable()
pr.print_stats(sort='cumtime')
```

50 sec might seem reasonable for 1 million lines, but I can load the same data with pandas in about a half second.
```python
import time
import pandas as pd
t0 = time.time()
df = pd.read_csv('tmp/junk.wel', skiprows=9,
            header=None,
            delim_whitespace=True, error_bad_lines=False).dropna(axis=0)
print("finished in {:.2f}s\n".format(time.time() - t0))
finished in 0.44s
```

In the real model that I am working with, the majority of the load time seems to be consumed by the well package and perhaps SFR, suggesting that speeding up the loading of list-like data could substantially cut the overall load time for transient regional models with lots of wells.

I realize that my pandas example is an overly simple kludge (and that there may not be an easy fix to this issue for all use cases), but I think the greater issue of performance is something to think about. `_load_list_line` and `_append_data_list` have quite a bit of complexity (they are respectively about 300 and 100 lines long, not including calls to other methods), and perhaps for good reason. But is this really needed for most uses? Isn't one of the paradigms of MODFLOW-6 supposed to be simple, clean input structures that are easily readable?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of loading tabular mf6 datasets #707

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

performance of loading tabular mf6 datasets #707

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions