Skip to content

Comments

Compute demand map when loading in data#276

Merged
alexdewar merged 21 commits intomainfrom
demand-time-slice-map
Jan 17, 2025
Merged

Compute demand map when loading in data#276
alexdewar merged 21 commits intomainfrom
demand-time-slice-map

Conversation

@alexdewar
Copy link
Collaborator

Description

The main purpose of this PR is to add code to compute the demand map as we load in data (#231), but along the way I've closed various other issues to do with validating the input data. Unfortunately this has ended up rather bigger than I wanted, but I couldn't think of a sensible way to split it up 😞. V happy to talk through it on Teams for anyone who's interested.

A demand map relates, for a given commodity, the region, milestone year and time slice to the demand for that commodity. There are two input files related to this: demand.csv, which contains the annual demand for each commodity, region and year combination, and demand_slicing.csv, which specifies how the annual demand is shared across different time slices. An additional complexity is that users can specify the demand slicing for a whole season or the whole year rather than just a single time slice. I've created a new DemandMap type, which currently just has a single method to look up the demand for a given region, year and time slice combination. Note that we aren't limited to this API though: while I'm using a HashMap with region + year + time slice as a key, we can still search through the map if, e,g, we want all values for a given region. I was just assuming that this is the way we'll want to look up demand most often, so it made sense to do it this way; we can always add other helpers later if needed.

As demand.rs is now getting rather long, I thought about splitting it up, but figured it was worth holding off on that until we refactor all the input code (I'm yet to create an issue for that).

Closes #202. Closes #203. Closes #204. Closes #230. Closes #231.

Type of change

  • Bug fix (non-breaking change to fix an issue)
  • New feature (non-breaking change to add functionality)
  • Refactoring (non-breaking, non-functional change to improve maintainability)
  • Optimization (non-breaking change to speed up the code)
  • Breaking change (whatever its nature)
  • Documentation (improve or add documentation)

Key checklist

  • All tests pass: $ cargo test
  • The documentation builds and looks OK: $ cargo doc

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@codecov
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 99.14530% with 7 lines in your changes missing coverage. Please review.

Project coverage is 95.95%. Comparing base (859f9c3) to head (69aa5d6).
Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
src/input/commodity/demand.rs 99.01% 0 Missing and 3 partials ⚠️
src/input/commodity/demand_slicing.rs 99.20% 1 Missing and 2 partials ⚠️
src/time_slice.rs 98.78% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #276      +/-   ##
==========================================
+ Coverage   95.29%   95.95%   +0.65%     
==========================================
  Files          26       27       +1     
  Lines        2613     3014     +401     
  Branches     2613     3014     +401     
==========================================
+ Hits         2490     2892     +402     
  Misses         47       47              
+ Partials       76       75       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to review the PR but, frankly, I'm having trouble following the code. It was already complicated as it was - hence the suggested code reorganisation - so adding more structures, more encapsulations and more functions linking everything together just feels overengineer at this point. Especially when we do not know, yet, what the structure to solve the problem - dispatch, investment - needs to be.

I think a (possibly long) meeting to discuss things through is necessary, to fully understand this PR. And define what mathematical problem we need to solve in reality, working backwards to define the appropriate structures to solve it.

src/demand.rs Outdated
}

/// A map relating commodity, region and year to annual demand
type AnnualDemandMap = HashMap<AnnualDemandMapKey, f64>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you define a proper struct for the DemandMap but here you just declare a sort of type alias?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've defined a few type aliases for things we use exclusively within the demand module, but I thought it made sense to make a new type for DemandMap as we're exporting it and doing this lets us define a more user-friendly interface for accessing the data (i.e. various getters).

src/demand.rs Outdated
year: u32,
}

type CommodityRegionPairs = HashSet<(Rc<str>, Rc<str>)>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same here.

@alexdewar
Copy link
Collaborator Author

@dalonsoa I hear what you're saying and I do think it would be worth having a meeting for anyone who's interested to talk about the changes (I guess it makes sense to do it in the new year now). I probably should have given more explanation in the PR description given the scale of the changes, but ended up rushing it a bit -- sorry!

Basically, though, I think we need to have code that does something like what's in this PR and I'm not sure there's any way around it. We might want to do something slightly different with the final DemandMap data structure (I mentioned HashMaps, but that was probably just a distraction); fundamentally, the code I've added is just necessary if you want to be able to figure out what the actual demand is for a particular combination of commodity + region + year + time slice. The reason this code is complicated is effectively because the interface is complicated: we could also just mandate that users provide a single CSV file with the actual demand for each combination of these parameters and that would give us the information we need; it just wouldn't be very user friendly, hence why there's this whole business of letting users specify how demand is shared across time slices (or ranges of time slices). I'm not saying it's the wrong input format (it seems pretty reasonable to me), but it is somewhat complicated and so the code that processes is will necessarily be complicated too.

Realistically, there's no model where it would ever make sense to store data in the kind of "raw" format we have now, which lets users specify things like "for commodity X and region Y, 40% of the demand is in winter". No model will care what the fraction of annual demand is, just as no model will want to know what the demand is for a whole season rather than for an individual time slice. So these two calculations -- converting fractional demand to actual demand and expanding time slice ranges -- will always be necessary. And if you put off doing this until you actually need to know what the values are a) figuring it out will be really slow and b) you won't be able to validate the data properly (e.g. we need to ensure that users don't specify demand slice entries for overlapping time slice ranges, such as winter and winter.day). There just really aren't many approaches that let you do this correctly.

With the last few days before the break I'll have a go at making a plan for how to tackle the rest of the project and it probably will make sense to have a meeting to talk this through in the new year. While it generally makes sense to think of the bigger picture, though, whatever we decide won't change how we calculate demand from the input files.

@alexdewar
Copy link
Collaborator Author

I'm converting this to draft because we've agreed that it makes sense to do the refactoring (#285) in this branch so that it's more readable and easier to review.

It would have been better to do the refactoring first, but unfortunately I'd already started on this demand map stuff by then 🤷

@alexdewar
Copy link
Collaborator Author

Right, I've refactored it now so that the demand-related CSV code is in input::commodity::demand and input::commodity::demand_slicing which should hopefully help with readability. I moved the remaining data structures in demand.rs into commodity.rs.

I'll mark this as ready for review now but we've agreed that we'll discuss it in a meeting.

@alexdewar alexdewar marked this pull request as ready for review January 13, 2025 16:30
@alexdewar alexdewar linked an issue Jan 13, 2025 that may be closed by this pull request
@alexdewar
Copy link
Collaborator Author

@tsmbland I've put the calculation of demand share into a separate, more generic function as you suggested.

I thought about writing the function so that it could take a map as an input argument and it would then fill it using the time slice + value pairs, but then I realised the map types will likely be different in every case (i.e. even though the keys will include a time slice component, they'll also likely include other things like region ID etc.). So instead I've made it return an iterator, which hopefully means it'll be fairly easy to reuse.

Copy link
Collaborator

@tsmbland tsmbland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, I think it would have been better (for now) to input actual demand data at the timeslice-level, and worry about fractions and time-slice ranges (which is ultimately just a user-experience choice) at a later date. However, since we haven't done that, I agree that something along these lines is necessary, both for validation and giving meaningful numbers to the optimization.

I really don't have any idea whether the HashMap structure with region, year and timeslice as keys is the right choice. I imagine this will become clearer as we go through, but we need a starting point and this seems as good as any.

In terms of the code, it seems reasonable and well tested, but it's very long and complicated so I'm giving some benefit of the doubt


/// Retrieve the demand for the specified region, year and time slice
pub fn get(&self, region_id: Rc<str>, year: u32, time_slice: TimeSliceID) -> Option<f64> {
self.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DemandMap is what's called a newtype. Essentially it's a struct with an unnamed HashMap inside it, and self.0 is how you access it.

value: f64,
) -> impl Iterator<Item = (&'a TimeSliceID, f64)> {
self.iterate_selection_share(selection)
.map(move |(ts, share)| (ts, value * share))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the most important line in terms of the maths. Could you just explain in human English what this is doing? E.g. what's move?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterate_selection_share returns an iterator of the relative lengths for each of the time slices (i.e. normalised so they sum to one) along with the time slice. This line just multiplies each of these relative lengths by value. I was in two minds about whether to split the calculation into two functions like this or just have the one tbh.

The move thing is just to do with Rust's ownership rules -- it doesn't affect the calculation. It's just that otherwise we wouldn't be able to return an iterator which uses value, because value goes out of scope at the end of this function. So the move "gives" value to this lambda function.

@alexdewar
Copy link
Collaborator Author

In hindsight, I think it would have been better (for now) to input actual demand data at the timeslice-level, and worry about fractions and time-slice ranges (which is ultimately just a user-experience choice) at a later date. However, since we haven't done that, I agree that something along these lines is necessary, both for validation and giving meaningful numbers to the optimization.

I agree. In retrospect I wish I hadn't gone so far down this rabbit hole 🙃. Oh well. We would have needed to do this at some point anyway, so at least it's done now.

I really don't have any idea whether the HashMap structure with region, year and timeslice as keys is the right choice. I imagine this will become clearer as we go through, but we need a starting point and this seems as good as any.

Yeah, we can always fiddle with it later (I think whatever we come up with though, it'll involve some kind of HashMap). I think this data structure kind of makes sense for loading in and validating the data in any case, but we may want to turn it into something else after we've done that.

In terms of the code, it seems reasonable and well tested, but it's very long and complicated so I'm giving some benefit of the doubt

👍

@alexdewar alexdewar merged commit f67887d into main Jan 17, 2025
7 checks passed
@alexdewar alexdewar deleted the demand-time-slice-map branch January 17, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants