Skip to content

Comments

Create ProcessFlowRaw struct#198

Merged
TinyMarsh merged 10 commits intomainfrom
flow-commodity-field
Dec 20, 2024
Merged

Create ProcessFlowRaw struct#198
TinyMarsh merged 10 commits intomainfrom
flow-commodity-field

Conversation

@TinyMarsh
Copy link
Collaborator

@TinyMarsh TinyMarsh commented Nov 11, 2024

Description

This PR introduces a new struct for raw process flows, and modifying the ProcessFlow struct to use a reference-counted Commodity object.

Fixes #166

Type of change

  • Bug fix (non-breaking change to fix an issue)
  • New feature (non-breaking change to add functionality)
  • Refactoring (non-breaking, non-functional change to improve maintainability)
  • Optimization (non-breaking change to speed up the code)
  • Breaking change (whatever its nature)
  • Documentation (improve or add documentation)

Key checklist

  • All tests pass: $ cargo test
  • The documentation builds and looks OK: $ cargo doc

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@alexdewar
Copy link
Collaborator

@TinyMarsh I was just talking with @tsmbland today about adding a ProcessFlowRaw struct as part of #248. I guess you guys should coordinate!

@TinyMarsh TinyMarsh marked this pull request as ready for review December 11, 2024 11:16
@TinyMarsh
Copy link
Collaborator Author

@tsmbland take a look at this PR; it adds the ProcessFlowRaw struct and a method for converting to ProcessFlow.

I need help with 2 things;

  1. What sort of validation is needed?
  2. How to actually implement the into_flow method.

Let me know if you wanna have a chat about this.

@tsmbland
Copy link
Collaborator

tsmbland commented Dec 11, 2024

Just spend a while trying to understand the general import/validation process. It varies for different tables depending on how much validation is required, but in general I think it looks like this (where X could be parameter, flow, PAC etc.):

ProcessX

  • final struct that will be used elsewhere in the program
  • relates to a single row of the table
  • no optional types

ProcessXRaw

  • initial struct after reading in the data from the file
  • may contain optional types

ProcessXRaw.into_x

  • applies defaults to optional values
  • calls validate
  • outputs a ProcessX

ProcessXRaw.validate

  • called by into_x
  • performs validation on each individual struct (e.g checking values are above zero)

read_process_xs

  • calls read_csv to create an iterator of ProcessXRaw
  • passes the iterator to read_process_xs_from_iter

read_process_xs_from_iter

  • calls into_x for each ProcessXRaw in the iterator to create a ProcessX
  • organises data into a hashmap of ProcessX
  • performs some validation on the full dataset (e.g. ensuring no processes are missing)

I'm assuming it's similar for other modules, although I haven't looked closely.

Anyway, I found it helpful to write out the process so figured I'd share it here. I guess this is what we're working towards for ProcessFlow.

@TinyMarsh To answer your questions:

  1. into_flow will get called by read_process_flows_from_iter, which doesn't exist yet

  2. I've got Check that no commodity is both an input and an output of a process #248 assigned to me which I guess will be implemented in read_process_flows_from_iter (when it exists). That might be everything to be honest.

@alexdewar Is that all correct? Anything to add?

@alexdewar
Copy link
Collaborator

Thanks @tsmbland, that's v comprehensive!

I don't have much to add, but more recently I've been avoiding having separate into_x() methods etc. and I've just put everything directly into the read_x_from_iter() functions, because I found that splitting things up was making the code a bit messy. Totally up to you though.

Your read_process_flows_from_iter function should just take an iterator of ProcessFlowRaws and turn them into a HashMap<Rc<str>, Vec<ProcessFlow>> (where the key is the process ID). See example here: https://github.com/EnergySystemsModellingLab/MUSE_2.0/blob/main/src/process.rs#L329

@codecov
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 98.07692% with 3 lines in your changes missing coverage. Please review.

Project coverage is 95.14%. Comparing base (19b2cf3) to head (c29a99a).
Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
src/process.rs 98.07% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #198      +/-   ##
==========================================
+ Coverage   95.10%   95.14%   +0.03%     
==========================================
  Files          13       13              
  Lines        2532     2614      +82     
  Branches     2532     2614      +82     
==========================================
+ Hits         2408     2487      +79     
- Misses         47       50       +3     
  Partials       77       77              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TinyMarsh
Copy link
Collaborator Author

Okay, I think I'm finally starting to understand how this all works. Thanks for your comprehensive summary @tsmbland!

This might be ready for a review now @alexdewar

@TinyMarsh TinyMarsh requested a review from tsmbland December 13, 2024 09:50
Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely along the right lines, but there are a few minor issues with the current implementation.

  1. We want to return errors rather than panicking
  2. It'd be better to have a separate function that just processes data without doing any I/O
  3. Would you mind writing a little test for this? There's only one kind of error you can get atm (bad commodity ID), so you just need to test this and the success case.

Copy link
Collaborator

@AdrianDAlessandro AdrianDAlessandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment. Agree with the tweaks Alex suggested. Looking good overall though!

@TinyMarsh
Copy link
Collaborator Author

Just double checking some logic here; most of the read_process_xs return an iterator where the type of the iterator is a single ProcessXRaw object. This is achieved by running the read_csv function.

However with the ProcessFlowRaw logic, instead of running read_csv we run read_csv_grouped_by_id. Which, instead of returning an iterator of type ProcessFlowRaw, it returns a HashMap with process_id as a key and a vector of ProcessFlowRaw as a value.

Just double checking this is correct behaviour. I.e. there are multiple ProcessFlowRaws per process_id.

@TinyMarsh
Copy link
Collaborator Author

Thanks for the feedback all. This should be ready for a re-review now.

@TinyMarsh
Copy link
Collaborator Author

TinyMarsh commented Dec 16, 2024

@alexdewar something in particular I wanted to get your feedback on this this potential piece of silliness:
https://github.com/EnergySystemsModellingLab/MUSE_2.0/blob/c4fdd9fda27889edc9cd4e7740c6b3451984e0e6/src/process.rs#L339

I don't like the use of remove here, since it seems like that function should have specific functionality outside of what I'm using it for. What I'm using it for here is to avoid getting a reference to an iterator of type ProcessFlowRaw, which is what I would get if I used get(), e.g.:

let iter = process_flow_raws.get(process_id).unwrap().into_iter();

I wanted to avoid getting and passing a reference to read_process_flows_from_iter() because that involved lots of reference lifetime stuff that seemed unnecessarily complicated to implement. Hopefully that makes sense.

@AdrianDAlessandro helped with this and together we landed on using remove(). Anything to add Adrian?

Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there, but I still think you can put more of the processing into read_process_flows_from_iter function.

I think the general approach should be:

  1. read_csv()
  2. Process the results (with map())
  3. Group the results by ID

Atm you're doing the grouping before the processing, which then makes things a bit more fiddly.

For an example, see read_process_pacs_from_iter: https://github.com/EnergySystemsModellingLab/MUSE_2.0/blob/main/src/process.rs#L329

(Though in your case you'll be able to use the into_id_map() helper rather than into_group_map()).

@TinyMarsh
Copy link
Collaborator Author

Ah okay, apologies for the faff here, I don't think I really understood the assignment initially.

I have refactored the logic as you described @alexdewar as the following;

fn read_process_flows_from_iter<I>(
    iter: I,
    process_ids: &HashSet<Rc<str>>,
    commodities: &HashMap<Rc<str>, Rc<Commodity>>,
) -> Result<HashMap<Rc<str>, Vec<ProcessFlow>>>
where
    I: Iterator<Item = ProcessFlowRaw>,
{
    iter.map(|flow_raw| {
        let process_id = process_ids.get_id(&flow_raw.process_id)?;
        let commodity = commodities
            .get(flow_raw.commodity_id.as_str())
            .with_context(|| format!("{} is not a valid commodity ID", &flow_raw.commodity_id))?;

        let process_flow = ProcessFlow {
            process_id: flow_raw.process_id,
            commodity: Rc::clone(commodity),
            flow: flow_raw.flow,
            flow_type: flow_raw.flow_type,
            flow_cost: flow_raw.flow_cost,
        };

        Ok((process_id, process_flow))
    })
    .into_id_map(process_ids)
}

fn read_process_flows(
    model_dir: &Path,
    process_ids: &HashSet<Rc<str>>,
    commodities: &HashMap<Rc<str>, Rc<Commodity>>,
) -> Result<HashMap<Rc<str>, Vec<ProcessFlow>>> {
    let file_path = model_dir.join(PROCESS_FLOWS_FILE_NAME);
    let process_flow_csv = read_csv(&file_path)?;
    read_process_flows_from_iter(process_flow_csv, process_ids, commodities)
        .with_context(|| input_err_msg(&file_path))
}

But I don't understand what the issue with using into_id_map is here.

image

@TinyMarsh
Copy link
Collaborator Author

Ah okay, apologies for the faff here, I don't think I really understood the assignment initially.

Never mind, I wasn't paying attention to what into_id_map is doing. There were some minor conflicts due to #273 but they should be fixed okay now.

@TinyMarsh TinyMarsh requested a review from alexdewar December 19, 2024 18:50
Copy link
Collaborator

@tsmbland tsmbland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tell! I'd just add an inline comment to read_process_flows_from_iter so it's clear what it's doing.

Also, shall we delete read_csv_grouped_by_id since it's no longer used anywhere?

Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tsmbland's comment. I think there's one other minor tweak to make (see comment), but other than that, we're good to go!

Sorry this has ended up going through so many revisions... I should have been clearer in the issue description 😟

src/process.rs Outdated
Comment on lines 314 to 316
.collect::<Result<Vec<_>>>()?
.into_iter()
.into_id_map(process_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do it this way, you can avoid allocating a Vec just to throw it away again:

Suggested change
.collect::<Result<Vec<_>>>()?
.into_iter()
.into_id_map(process_ids)
.process_results(|iter| iter.into_id_map(process_ids))?

Removes the unused read_csv_grouped_by_id function. Adds a comment to read_process_flows_from_iter so it's clear what it's doing. Refactors code to makes use of process_results.
@TinyMarsh
Copy link
Collaborator Author

Thanks both. I'll merge at the end of the day unless I hear from you regarding latest commit.

@TinyMarsh TinyMarsh merged commit 86cfaad into main Dec 20, 2024
7 checks passed
@TinyMarsh TinyMarsh deleted the flow-commodity-field branch December 20, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Make ProcessFlow::commodity_id into Rc<Commodity>

4 participants