Skip to content

Conversation

@lidavidm
Copy link
Member

This expands UnifySchemas to handle more types, e.g. int32 + float64 = float64, by adding flags for various types.

The flags here may be overly fine-grained.

Not implemented here:

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high level questions from a quick look. But this will be a nice improvement to have more flexible schema unification!
(I personally don't have a strong opinion on the need of exposing the fine-grained options or not. It seems fine to me to keep this)

On the python side, I think this can be exposed in concat_tables, which uses ConcatenateTables under the hood which already supports passing those field_merge_options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, that would mean that int64 cannot be unified with float32? (since that bitwidth is not "equal or greater") Or that it would be unified to float64?

For reference, numpy promotes that specific case to float64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As implemented, you need also promote_numeric_width to unify int64 and float32 (to float64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I'm wrong (awkward). Yes, int64 + float32 = float64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. or greater bit width?

Eg in numpy uint32 + int64 -> int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reading clarity, can we list here the ones that are still left to False in a comment? (eg promote_dictionary_ordered is still False, which I think is good)

promote_decimal should probably be added as true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I didn't update this after adding some of the additional options. I'll document what's not enabled as you suggest, thanks.

@lidavidm
Copy link
Member Author

Corrected the doc comments.

For concat_tables, this requires us to have access to Cast() inside table.cc, so we need ARROW-8891 or some #ifdef'ing to implement this.

@lidavidm
Copy link
Member Author

I've used some #ifdefs so that ConcatenateTables can cast columns as necessary now. I still have some TODOs listed up top before I can undraft this.

@pitrou
Copy link
Member

pitrou commented Jul 13, 2022

@lidavidm Do we still want to do this? If so, do you want to push it forward?

@lidavidm
Copy link
Member Author

I think it's still useful to have, but I don't think I'll have time to push this forward anytime soon.

@lidavidm lidavidm closed this Feb 16, 2023
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2023
…36846)

Revival of #12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…es` (apache#36846)

Revival of apache#12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…es` (apache#36846)

Revival of apache#12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants