Skip to content

performance of loading tabular mf6 datasets #707

@ghost

Description

consider this model, which has a well package with 1M fluxes:

import time
import cProfile
import numpy as np
import pandas as pd
import flopy

model_name = 'junk'
workspace = 'tmp'
nper = 100

sim = flopy.mf6.MFSimulation(sim_name=model_name, version='mf6', exe_name='mf6', 
                             sim_ws=workspace)
tdis_rc = [(1.0, 1, 1.0)] * nper
tdis = flopy.mf6.ModflowTdis(sim, 
                             nper=nper, perioddata=tdis_rc)
gwf = flopy.mf6.ModflowGwf(sim, modelname=model_name,
                           model_nam_file='{}.nam'.format(model_name))
dis = flopy.mf6.ModflowGwfdis(gwf, nlay=1, nrow=2, ncol=2)
nwells = 10000
spd = {}
for kper in range(100):
    kspd = flopy.mf6.ModflowGwfwel.stress_period_data.empty(gwf,
                                                      maxbound=nwells,
                                                      boundnames=True)[0]
    kspd['cellid'] = [(0, 0, 0)] * nwells
    kspd['q'] = np.random.randn(nwells)
    kspd['boundnames'] = 'well'
    spd[kper] = kspd
wel = flopy.mf6.ModflowGwfwel(gwf, pname='wel', #maxbound=nwells,
                              stress_period_data=spd,
                              save_flows=True)
sim.write_simulation()

if you profile loading the model back in, you can see that almost the entire time (50 sec on my machine) is spent on the _load_list_line (1M calls) and _append_data_list (2M calls) methods in mffileaccess.py. (the load actually errors out, but the time is similar to what I got for a real-life example, suggesting that it was almost done).

pr = cProfile.Profile()
pr.enable()
flopy.mf6.MFSimulation.load('junk', sim_ws='tmp')
pr.disable()
pr.print_stats(sort='cumtime')

50 sec might seem reasonable for 1 million lines, but I can load the same data with pandas in about a half second.

import time
import pandas as pd
t0 = time.time()
df = pd.read_csv('tmp/junk.wel', skiprows=9,
            header=None,
            delim_whitespace=True, error_bad_lines=False).dropna(axis=0)
print("finished in {:.2f}s\n".format(time.time() - t0))
finished in 0.44s

In the real model that I am working with, the majority of the load time seems to be consumed by the well package and perhaps SFR, suggesting that speeding up the loading of list-like data could substantially cut the overall load time for transient regional models with lots of wells.

I realize that my pandas example is an overly simple kludge (and that there may not be an easy fix to this issue for all use cases), but I think the greater issue of performance is something to think about. _load_list_line and _append_data_list have quite a bit of complexity (they are respectively about 300 and 100 lines long, not including calls to other methods), and perhaps for good reason. But is this really needed for most uses? Isn't one of the paradigms of MODFLOW-6 supposed to be simple, clean input structures that are easily readable?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions