Skip to content

Conversation

@mscheltienne
Copy link
Member

@mscheltienne mscheltienne commented Sep 21, 2021

Following the discussion on #9752 and the quick testing in #9757 followed by a git CLI mess up; here is a clean version exposing the keys of Info as attributes in a similar way to Bunch. Instead of inheriting from Bunch; or changing the Bunch from MNE to the version of scikit-learn, I simply brought the methods __setattr__ and __getattr__ from scikit-learn's Bunch to the Info class.

@mscheltienne
Copy link
Member Author

mscheltienne commented Sep 21, 2021

@larsoner I've checked everything I had in mind.
At first, I wasn't sure we could add some check for the first hierarchical level of Info in __setattr__ as discussed; because I thought both the assignment syntax Info['key'] and Info.key called __setattr__. I was mistaken and the first one is calling __setitem__.

This change makes it possible to add input checks for assignments of attributes with Info.key, while all behavior with item assignment is unchanged. It effectively turns the syntax Info.key into the public interface while Info['key'] becomes the private interface used by MNE functions.

@hoechenberger
Copy link
Member

hoechenberger commented Sep 29, 2021

I didn't follow the discussion, but I'm very against offering TWO ways to get the same info value. This will lead to confusion.

If we believe that dict-style indexing with square brackets is not good, we should entirely abandon it, not add an alternative method while keeping the old one in place.

Also I'm absolutely not a fan of Bunch as it doesn't allow static code analysis tools to provide completion suggestions. A dataclass would probably be a much better choice here.

So unless I'm missing something essential here, this is a clear –5 from me.

@larsoner
Copy link
Member

So unless I'm missing something essential here, this is a clear –5 from me.

Did you read the entire discussion in #9752 and the linked issues? I thought the reasons for sticking with Bunch/dict subclass and backward compat with __setitem__/__getitem__ were covered in there.

@hoechenberger
Copy link
Member

Did you read the entire discussion in #9752 and the linked issues?

Reading up now, as I probably should've done earlier 😅

@hoechenberger
Copy link
Member

hoechenberger commented Sep 29, 2021

Did you read the entire discussion in #9752 and the linked issues?

Reading up now, as I probably should've done earlier 😅

Ok now I did, but my concerns are not addressed there:

  • we'll end up with two alternative approaches to do the same thing, and it will be confusing
  • Bunch doesn't work with static analysis tools, which is a hard blocker for me

To address the 2nd issue, if dataclass is not an option: I'd say let's cook up our own class that derives from dict, only supports a specified set of keys, and adds property getters and setters so everything can get accessed via the .attribute syntax too. And we can steal the bit from BunchConst that controls whether and which values can be changed.

Maybe we can discuss this during the next office hour.

My personal preference right now would clearly be a dataclass.

@mscheltienne
Copy link
Member Author

@hoechenberger To build upon your comment, the goal is to provide a way to implement error checking for the attributes, in the future. Note that in this PR, I did not change the inheritance of Info from dict to Bunch as we discussed in #9752, but instead added the __setattr__ and __getattr__ in a similar way to the implementation of Bunch (in the scikit-learn version).

Indeed, you now have 2 alternatives approaches to do the same thing. In my opinion, the .attribute syntax should become the new public API (examples, tutorials would have to be changed) while the dictionary [attribute] syntax would be the private API providing both backward compatibility and direct access to the attributes without error checking (i.e. as it is now), essentially to keep the different property setters working (e.g. change ch_names by using mne.rename_channels). For the .attribute syntax you can now implement error checking/logger warning preventing the user from changing attributes he shouldn't change directly in __setattr__.

I'm not sure about dataclass, mostly because some of the attributes would have to raise an error or log a warning as they should not be directly changed (e.g. chs) and others should not. Is it possible to froze only part of the dataclass attributes?

@mscheltienne
Copy link
Member Author

mscheltienne commented Sep 29, 2021

And you are right, setting the attribute in a similar way to Bunch does not make them appear to static analysis tool/completion tool; so on one hand it was very handy because it would generate automatically all attributes and centralize the error checking, but on the other, it lacks the completion feature. Back to properties and properties setters? It would make the code longer, but it can still be done relatively easily, one attribute at a time.

@hoechenberger
Copy link
Member

Thanks for taking the time to look into this and for your explanations, @mscheltienne! I'll need to think a bit about potential approaches and their pros/cons before I can continue this discussion. But I'd like to say that I appreciate the attempts to rethink and challenge existing designs in order to make things more approachable and modern!

@cbrnr
Copy link
Contributor

cbrnr commented Sep 29, 2021

I haven't thought about this before, but I agree with @hoechenberger that having two ways to access the underlying data will be confusing (and it doesn't matter if one is public and the other private).

@larsoner
Copy link
Member

we'll end up with two alternative approaches to do the same thing, and it will be confusing

I haven't thought about this before, but I agree with @hoechenberger that having two ways to access the underlying data will be confusing (and it doesn't matter if one is public and the other private).

I also agree having two (attribute and __setitem__) ways is worse than having one. However -- and it looks like this point is not made clearly in the other threads -- I think we need to keep __setitem__ for backward compatibility. It is one of the oldest parts of our API (see for example some MNE example code from over 10 years ago). I wouldn't be surprised if, say, 90%+ of users in 50%+ of their scripts/pipelines in the wild do a variant of raw.info['bads'] = .... Deprecation always requires some amount of pain, but forcing this many people to have to potentially change this many scripts should require incredible usability/maintainability benefits in return. Changing from __setitem__ to attribute access does not come particularly close to meeting that criteria for me. So if people really want attribute access, I think we need to keep two methods of access.

With this in mind, given how much people want attribute access, I'm okay with the potential confusion of having two ways of doing things -- I think clear documentation of the class is good enough to mitigate this here, and I think the confusion level here will be small to begin with. We say "attribute access is preferred, but []-based access is equivalently present for backward compat". I don't think it's much mental work on the end of the users, and certainly way less than having to adapt all their existing analysis scripts if we did a deprecation cycle to get rid of __setitem__.

But I'm also okay with dropping the idea of attribute-based access altogether if people see that as not a big enough win in itself to offset the two-way-to-do-one-thing issue.

Bunch doesn't work with static analysis tools, which is a hard blocker for me

This is the same as in main, so in some sense the perfect is becoming the enemy of the good here, assuming you think having attribute-based access is important in and of itself. Again it's about tradeoffs, and I'm not sure it's worth throwing out the usability benefits of attribute access if we can't simultaneously make things work for static code checkers is worth it. I'm not sure a complete overhaul/abandonment of dict is possible, so incremental change might be our best bet here.

Along those lines (in addition to the above) regarding moving to some new base class like dataclass, I again worry we will break people's code. People might already be doing things like raw.info.update(bads=..., description=...) and it's perfectly valid to do so. We'd have to wrap these or deprecate them (pain for us and/or pain for users). We'd also have to update other things -- for just for one example, h5io would need to understand our new class -- maybe if it's dataclass or something we could do it in a generic way but details like this are not straightforward.

FWIW tab completion already works for Bunch in IPython and plain Python, so I think this is less critical than it would seem, especially because there are only a handful of attributes of this class that users should actually need to modify at all. (For something like mne.io.constants.FIFF this is more annoying (VSCode is confused), but 1) in this case developers are mostly annoyed and 2) it's easy enough to look up online or quickly do in a terminal mne.io.constants.FIFF.FIFFV_<tab><tab>.)

@cbrnr
Copy link
Contributor

cbrnr commented Sep 29, 2021

Thanks for summarizing the issues @larsoner. I think the original issue was not to create/add attribute access, but to perform error checking. I am not a fan of set_XXX methods, so if we decide to keep the dict around I would like to be able to just assign to individual keys (e.g. raw.info["bads"] = ...), which has been perfectly legitimate AFAIK. Writing a custom __setitem__ method seems the way to go, and this is completely orthogonal to attribute access, right?

@larsoner
Copy link
Member

I think the original issue was not to create/add attribute access

It comes up from time to time, to list a few:

There are probably others.

Writing a custom setitem method seems the way to go, and this is completely orthogonal to attribute access, right?

Yes, and we probably should at least give this a shot. I don't think this is so controversial because we can try some variant of it regardless of the user-facing access methods.

@mscheltienne
Copy link
Member Author

Correct me if I'm wrong, but writing a custom __setitem__ method is not good enough. The distinction between public/private interface is important because MNE functions such as rename_channels need to be able to change different attributes/items (chs, ch_names, ...) while those attributes/items should be frozen from a user perspective.

Example: (a bit extreme with the raising of an error)

def __setitem__(...):
    if item == 'chs':
        raise ValueError('You can not change that!')

With this simple example, how do you get rename_channels to change ['chs']?

As @larsoner stated, getting rid of the item-based access is not an option, as it will break way too much code and the compensation is too thin. This is why I am leaning towards making the dictionary access the private API for backward compatibility and for internal operations while adding a new public API in the form of attributes or properties.

@hoechenberger
Copy link
Member

With this simple example, how do you get rename_channels to change ['chs']?

You create a new frozen Info object.

@hoechenberger
Copy link
Member

hoechenberger commented Sep 30, 2021

FWIW tab completion already works for Bunch in IPython and plain Python, so I think this is less critical than it would seem

"We" (~Paris-based MNE folks) suggest new users use VS Code, which would use Pylance for static analysis and completion suggestions. So this is an issue for us / our users 😅

@jasmainak
Copy link
Member

Gosh, I should be careful what I post on github :) Can't recall the context of the original discussion. But agree with the feelings expressed here that the API is really old and should not be broken. I may have really old scripts that will stop working if that happens. The only real complain I've seen folks have about MNE at Martinos is that the API changes too often and breaks old scripts.

The dictionary vs class debate is something I've struggled with myself in other packages (e.g., HNN). While class is often neater, it's hard to foresee as we're often inheriting legacy code. The nice thing about Info is that it's well scoped out and follows the structure of the FIF file format. My vote is to keep it the old way but add error checks in relevant places ... e.g., in IO code. If you really don't like the lack of introspection, fix it in upstream IPython maybe? :) Don't like the confusion Bunch will add either. What if in 2 years someone says we should deprecate the dict because of the confusion? Do we need to start looking for our old scripts then ...?

@mscheltienne
Copy link
Member Author

mscheltienne commented Oct 4, 2021

I didn't have time to discuss this further last Friday, and surprisingly no one wanted to engage this polemic subject 😅, but here is a quick recap to either close this or keep it moving:

Goals:

  • Add some degree of error checking/warnings to assignments to Info keys/attributes.
  • (optional) Dictionary key access doesn't work well with code completion engines. Attribute access would improve usability.

Musts:

  • Dictionary key accessed must be preserved (at least for backward compatibility).

Ideas:


  1. Change inheritance from dict to Bunch or implement a Bunch-like structure.
Pros/Cons

Pros:

  • Doesn't break anything.
  • Offers a clear and easy way to add error checking in __setattr__.
  • Adds attribute access with error checking.
  • Preserve dictionary key-base access without error checking for backward compatibility.
  • Very little change to the actual code.

Cons:

  • Doesn't work with all static code analysis/completion engines. e.g. pylance for VS Code.
  • Adds attribute access on top of dictionary key access, offering 2 ways to access the same information.
  • (IMO) Extensive documentation/examples/tutorial modification to clearly state

"attribute access is preferred, but []-based access is equivalently present for backward compact"


  1. Add error checking directly to __setitem__
Pros/Cons

Pros:

  • Keeps a single access method to information in Info.
  • Shouldn't break anything.
  • No modification is needed to the documentation/examples/tutorials.

Cons:

  • (IMO) Push either the private or public API outside the current Info, e.g. with a new frozen Info object.
  • (IMO) Extensive code changes (with a higher risk of breaking something).
  • No improvement for static code analysis/completion engines.

  1. Add error checking elsewhere, e.g. in I/O roundtrip
Pros/Cons

Pros:

  • Keeps a single access method to information in Info.
  • Doesn't break anything.
  • No modification is needed to the documentation/examples/tutorials.

Cons:

  • Doesn't raise/warn at assignment.
  • (IMO) Decentralize the error checking for Info into multiple locations, making maintenance harder.
  • No improvement for static code analysis/completion engines.

  1. Add each valid key as property and use the setters for error checking
Pros/Cons

Pros:

  • Doesn't break anything.
  • Works with static code analysis/completion engines.
  • Adds attribute access with error checking.
  • Preserve dictionary key-base access without error checking for backward compatibility.

Cons:

  • Makes the class Info very long with a long list of property + setter definitions.
  • Adds attribute access on top of dictionary key access, offering 2 ways to access the same information.
  • (IMO) Extensive documentation/examples/tutorial modification to clearly state

"attribute access is preferred, but []-based access is equivalently present for backward compact"

This is not exhaustive, if I forgot anything or made a mistake, please add it/correct it.
What do you want to do next? Keep it going with one of these options or another? Close it? Your call 😉

@cbrnr
Copy link
Contributor

cbrnr commented Oct 5, 2021

Thanks for the nice summary @mscheltienne! I don't understand what "Push either the private or public API outside the current Info, e.g. with a new frozen Info object" for option 2 means – could you explain (with an example)?

Otherwise, it looks like option 3 could be a good compromise – the Info API doesn't change at all, and we can implement error checking wherever we think it is important. Could you give an example how this could be implemented for e.g. setting bad channels?

@mscheltienne
Copy link
Member Author

@cbrnr IMO, and I might be wrong, you need an access door with error checking for users, and another without error checking for MNE functions and internal operations. @hoechenberger suggested creating a new frozen info object (for users?), I am not sure how this would play out, maybe he can explain it further?

@hoechenberger
Copy link
Member

you need an access door with error checking for users, and another without error checking for MNE functions and internal operations.

Why would you need that?

@hoechenberger
Copy link
Member

@hoechenberger suggested creating a new frozen info object (for users?)

This is only if you want read-only properties as you suggested in your example. I don't think we need those, though.

@mscheltienne
Copy link
Member Author

Exactly for that, read-only access from a user perspective. If we don't need it, all the better.

@cbrnr
Copy link
Contributor

cbrnr commented Oct 5, 2021

So what are the options if we want to keep dict access and not add attribute access in addition? I'd be strongly in favor of having only one API, and that needs to be dict access for compatibility. Options 2 and 3?

@mscheltienne
Copy link
Member Author

@cbrnr Yes, 2 and 3 don't add an API.

@larsoner
Copy link
Member

larsoner commented Oct 5, 2021

It might be possible to do (2) in a way that isn't terrible:

class Info(dict):
    def __init__(self, *args, **kwargs):
         with self._unlocked():
             super().__init__(*args, **kwargs)

    def __setitem__(self, key, val)
         if self._locked:
              ... # check
         super().__setitem__(key, val)

    @contextmanager
    def _unlocked(self, check_after=True):
        self._locked = False
        try:
            yield
        else:
            if check_after:
                self._check_consistency()
        finally:
            self._locked = True

Then we internally/privately use with info._unlocked(): (or set locked/unlocked manually as needed) for example in create_info, mne/io, mne/channels, and some other places. We'll need it in probably it in many places, but to me it's worth it because even this code churn makes it more explicit that we're explicitly operating in unsafe/expert mode. We also easily get automatic _check_consistency calls at the re-locking step, currently we do this with manual calls anyway (currently we have 70+ _check_consistency calls in our code).

@drammock
Copy link
Member

drammock commented Oct 5, 2021

It might be possible to do (2) in a way that isn't terrible

I like this proposal, though I would make the method/context manager self._unlock() (a verb) rather than self._unlocked() (a noun → implies a state that might persist)

@mscheltienne
Copy link
Member Author

@hoechenberger @cbrnr @jasmainak What are your thoughts about @larsoner proposal?

@cbrnr
Copy link
Contributor

cbrnr commented Oct 13, 2021

Sounds good to me!

@jasmainak
Copy link
Member

I like it! How are we going to handle keys that are always unlocked? E.g., info['bads']

@mscheltienne
Copy link
Member Author

@jasmainak With a check in __setitem__, if the key is among the list of locked keys, it goes through the 'is_it_locked' process; else it skips it.

@cbrnr cbrnr changed the title Expose Info keys as attributes with scikit-learn's Bunch syntax. Expose Info keys as attributes with scikit-learn's Bunch syntax Oct 14, 2021
@mscheltienne
Copy link
Member Author

Everyone seems to be fine with the proposition by @larsoner #9767 (comment)
Closing and giving a shot to his proposition in a fresh PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants