Skip to content

Generalize file up/download; use file-system prefix instead of FileSystemStorage#47

Merged
reddigari merged 10 commits intomasterfrom
dev/mthomas
Apr 11, 2020
Merged

Generalize file up/download; use file-system prefix instead of FileSystemStorage#47
reddigari merged 10 commits intomasterfrom
dev/mthomas

Conversation

@reelmatt
Copy link
Member

@reelmatt reelmatt commented Apr 10, 2020

This PR at least partially addresses some of the bugs @reddigari summarized in #43 dealing with file naming/access.

Workflow still relies on Django's FileSystemStorage API, but is now passed in to the constructor rather than being defined in multiple places: pyworkflow/node.py and workflow/node views.py. Most file operations are still handled in the Workflow class, but the create_node method can now do the following:

json_data["options"][field] = request.pyworkflow.fs.path(opt_value)

when a user uploads a file during node configuration, or when a download is triggered from WriteCsv. If a user enters test.csv for the output file, when the update endpoint is triggered, this is converted to /tmp/test.csv using the Workflow's file system.

This helps solve the "giant security problem" of downloads, if we switch the endpoint to POST with the Node's info. It's still hard-coded for the WriteCsvNode because it looks up the Node's path_or_buf option where the output filename is stored, but this does prevent arbitrary downloads.

The store_node_data and retrieve_node_data static methods in Workflow haven't been updated but I think (?) these can be done easily if this approach seems good.

Copy link
Collaborator

@reddigari reddigari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment in #43 about my concerns with putting fs entirely inside the Workflow, but outside of that this looks great!

Any thoughts on how to avoid the explicit use of path_or_buf which is exclusive to pandas.DataFrame.to_<filetype>()? Since it seems like we're moving away from calling the pandas functions with **kwargs, perhaps we could have an option name dedicated to a to-be-downloaded file (e.g. options["output_file"])?

@reelmatt
Copy link
Member Author

reelmatt commented Apr 10, 2020

One of the ideas I had behind the intermediate classes like IONode and ManipulationNode was it could be useful to store information common to any node of that type (e.g. 'files' for IONodes, and ¯_(ツ)_/¯ for ManipulationNodes). ReadCsv would then treat that 'file' as input and WriteCsv would treat it as output, but both could be read/accessed similarly. And this could avoid having to specify 'input_file' or 'output_file' when configuring the Node; you could just have 'file' and then the concrete Node implementation would figure it out as long as some value is provided.

Likewise, this could probably help when creating nodes too. I think we could avoid checking the OPTION_TYPES completely à la

for field, info in json_data.get("option_types", dict()).items():
    if info["type"] == "file" or info["name"] == "Filename":
        ...

and instead just check DEFAULT_OPTIONS if we knew any file/filepath is stored in a generic "file" attribute.

json_data["options"]["file"]

@reelmatt
Copy link
Member Author

This latest commit—51c4ee5— does change file handling a bit more than the initial PR which moved FileSystemStorage solely to pyworkflow/workflow.py. The API is now not used, an instead settings.MEDIA_ROOT is passed in to the Workflow constructor and regular Python os calls are used.

I think this better reflects the discussion from #43, but is obviously a bigger change with potentially big implications. If this deviates too far from what others are thinking, we can always revert to before entirely, or c664947 from earlier in this PR.

@reelmatt reelmatt requested a review from reddigari April 10, 2020 21:21
@reelmatt reelmatt changed the title Move fs to Workflow object; generalize file upload/download Generalize file up/download; use file-system prefix instead of FileSystemStorage Apr 10, 2020
Copy link
Collaborator

@reddigari reddigari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a method Workflow.full_path(self, filename) to abstract all the os.path.join() calls.

I'm super happy with this refactor, but as you said, if there are consequences I'm not seeing, reverting to the previous commit is totally reasonable.

"""

def __init__(self, graph=nx.DiGraph(), file_path=None, name='a-name', flow_vars=nx.Graph(), file_system=fs):
def __init__(self, graph=nx.DiGraph(), file_path=None, name='a-name', flow_vars=nx.Graph(), root_dir=settings.MEDIA_ROOT):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried writing a pyworkflow unit test that imported django.conf.settings, and it complains about Django not being configured. So I would either set /tmp as the default, or make it None and then set it to os.getcwd() when checking its existence in __init__. That would allow someone to use pyworkflow as a regular-old-python-module and have all their output where they're working.

Copy link
Collaborator

@diegostruk diegostruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I ran the code and couldn't find any bugs in the functionality but like you mentioned worse case scenario we can revert.

@reddigari reddigari merged commit a5e7aca into master Apr 11, 2020
@reelmatt reelmatt linked an issue Apr 11, 2020 that may be closed by this pull request
@reelmatt reelmatt deleted the dev/mthomas branch April 11, 2020 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File naming/access in pyworkflow vs. Django vs. CLI

4 participants