Skip to content

Conversation

@mikegraham
Copy link
Contributor

closes #15224

@mikegraham
Copy link
Contributor Author

Here's an initial pass at stealing https://github.com/python-git/python/blob/master/Objects/tupleobject.c#L290 for the combining. I am not 100% that the problem is my (rather crude) combiner, but possibly the exact way we're using the bitmixer in hash_array. I'm trying to think about it........I think we might be maintaining undesirable linearity.

May I ask, how did you encounter these collisions?

@mikegraham
Copy link
Contributor Author

If the basic approach looks sound I can add some comments around some of the lazy iterator wackiness.

arrays = itertools.chain([first], arrays)

mult = np.zeros_like(first) + np.uint64(1000003)
out = np.zeros_like(first) + np.uint64(0x345678L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the L is not working in py3. (remove it and its ok).

@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

@mikegraham the collisions I found by hashing

In [3]: i = pd.MultiIndex.from_product([np.arange(1000),np.arange(1000)],names=['one','two'])

In [4]: i.to_dataframe(index=False) 

which is basically a cartesian product of 1000 x 1000. nothing special really, just a test case I am using.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jan 25, 2017
@mikegraham mikegraham force-pushed the emulate_tuple branch 2 times, most recently from 7117b6b to e52c872 Compare January 25, 2017 21:10
@jreback jreback added this to the 0.20.0 milestone Jan 25, 2017
@jreback
Copy link
Contributor

jreback commented Jan 25, 2017

closing in favor of in #15224

thanks @mikegraham

@jreback jreback closed this Jan 25, 2017
jreback added a commit that referenced this pull request Jan 27, 2017
closes #15227

Author: Jeff Reback <jeff@reback.net>
Author: Mike Graham <mikegraham2gmail.com>

Closes #15224 from jreback/mi_hash2 and squashes the following commits:

8b1d3f9 [Jeff Reback] not correctly hashing categorical in a MI
48a2402 [Jeff Reback] support for mixed type arrays
58f682d [Jeff Reback] memory optimization
0c13df7 [Mike Graham] Steal the algorithm used to combine hashes from tupleobject.c
e8dd607 [Jeff Reback] add hash_tuples
44e9c7d [Mike Graham] wipSteal the algorithm used to combine hashes from tupleobject.c
e507c4a [Jeff Reback] ENH: support MultiIndex and tuple hashing
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#15227

Author: Jeff Reback <jeff@reback.net>
Author: Mike Graham <mikegraham2gmail.com>

Closes pandas-dev#15224 from jreback/mi_hash2 and squashes the following commits:

8b1d3f9 [Jeff Reback] not correctly hashing categorical in a MI
48a2402 [Jeff Reback] support for mixed type arrays
58f682d [Jeff Reback] memory optimization
0c13df7 [Mike Graham] Steal the algorithm used to combine hashes from tupleobject.c
e8dd607 [Jeff Reback] add hash_tuples
44e9c7d [Mike Graham] wipSteal the algorithm used to combine hashes from tupleobject.c
e507c4a [Jeff Reback] ENH: support MultiIndex and tuple hashing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants