Skip to content

Improvements to SplitMatrix#91

Closed
Marc-Antoine Schmidt (MarcAntoineSchmidtQC) wants to merge 11 commits intomasterfrom
SplitMatrix-improved-usability
Closed

Improvements to SplitMatrix#91
Marc-Antoine Schmidt (MarcAntoineSchmidtQC) wants to merge 11 commits intomasterfrom
SplitMatrix-improved-usability

Conversation

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member

  • Allow SplitMatrix to be constructed from another SplitMatrix.
  • Allow inputs of SplitMatrix to be 1-d
  • Implement __getitem__ for column subset
  • Also had to implement column subsetting for CategoricalMatrix
  • __repr__ uses the __repr__ method of components instead of str()

ToDo:

  • FIX BUG WITH _split_col_subsets (first confirm that it's a bug)
  • Add testing for new features

Checklist

  • Added a CHANGELOG.rst entry

- Allow SplitMatrix to be constructed from another SplitMatrix.
- Allow inputs of SplitMatrix to be 1-d
- Implement __getitem__ for column subset
- Also had to implement column subsetting for CategoricalMatrix
- __repr__ uses the __repr__ function of components instead of str()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment thread src/quantcore/matrix/split_matrix.py Outdated
colmap[idx] = [i, j]
return colmap

def _split_col_subsets_unordered(self, cols):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems occasionally to return empty lists when one indexes columns with a list, which causes is_sorted to throw an error later.

MRE:

import pandas as pd
import quantcore.matrix as mx

df = pd.DataFrame({"u": ["a", "b"], "v": ["a", "b"]})
X = mx.from_pandas(df, cat_threshold=False, object_as_cat=True)
X[:, [0, 1]]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't replicate this on macOS. I assume you tried this on Windows. Is that correct? Can you give me some info on your setup?

Also, you said "occasionally". Does it always throw an error when you try your MRE or not?

Copy link
Copy Markdown
Member

@lbittarello Luca Bittarello (lbittarello) Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it always throw an error when you try your MRE or not?

Yes. But it doesn't throw an error if we reduce the number of rows from two to one, for example.

I assume you tried this on Windows. Is that correct?

Yes. I'm not embarrassed.

Can you give me some info on your setup?

python           : 3.8.3.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
byteorder        : little
pandas           : 1.3.0
numpy            : 1.20.3

Is this helpful? Do you need something else?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps add unit tests for indexing? The Windows CI could come in handy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unit testing will be very valuable. I'll temporarily add the Windows CI to this PR for all the pushes. And yes, this is helpful.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on indexing unit tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge!! Thanks Marc for making a really big improvement to SplitMatrix. Let me know if there are any pieces that you would like to hand off. It feels like you started what seemed like a small project and it's creeped outwards a couple times now. So, don't feel like you're committed and stuck finishing this if you have other important stuff going on.

return CategoricalMatrix(self.cat[row])
else:
# return a SparseMatrix if we subset columns
return SparseMatrix(self.tocsr()[row, col], dtype=self.dtype)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite inefficient because we construct the full sparse matrix with self.tocsr() before subsetting it. I'm fine leaving it like this for now, but I think it'd be good to at least leave a TODO comment or add an issue mentioning this performance bug. Fixing this for the single element case is quite easy. For the [:, cols] case, I guess we need to construct a sparse matrix element by element and I'm guessing that it'll be easiest to do that in Cython.

Comment thread src/quantcore/matrix/split_matrix.py
colmap[idx] = [i, j]
return colmap

def _split_col_subsets_unordered(self, cols):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on indexing unit tests.

added docstring

Co-authored-by: Ben Thompson <t.ben.thompson@gmail.com>
This is a big commit with many changes:
- partial support of integer indexing
- removed advanced indexing for densematrix and sparsematrix
- ensure that indexing of splitmatrix generates basematrix type
- partial fix of standardizedmatrix indexing
 - added indexing tests (currently fails)
@waclawkusnierczykqc
Copy link
Copy Markdown

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member Author

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

Great idea. There are finished features that I would like to merge soon so I'll create more focused PRs with them.

@waclawkusnierczykqc
Copy link
Copy Markdown

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

Great idea. There are finished features that I would like to merge soon so I'll create more focused PRs with them.

Yes, being able to push some of the changes independently of others is one advantage.
Among the others, there is clear focus, easier reviewing, and easier reverting if need be.

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member Author

This PR has been separated into chunks. See PR #109, PR #110, and PR #111.

What remains is the big mess that is column indexing with splitmatrix.

@MarcAntoineSchmidtQC
Copy link
Copy Markdown
Member Author

closing. Most changes were implemented in other PRs and we will clearly take another approach to dealing with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants