Skip to content

Fix support with cuDF 2602.#12140

Merged
trivialfis merged 15 commits intodmlc:masterfrom
trivialfis:cudf-2602
Apr 18, 2026
Merged

Fix support with cuDF 2602.#12140
trivialfis merged 15 commits intodmlc:masterfrom
trivialfis:cudf-2602

Conversation

@trivialfis
Copy link
Copy Markdown
Member

@trivialfis trivialfis commented Apr 3, 2026

Close #12138

This PR drops support for older versions of cuDF, in exchange for cleaner code without the arrow C array parsing. Shared by @mroeschke

  • Fix validity mask handling.
  • Replace arrow c device array with CUDA array interface.

todos:

  • Revert CI tag.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cuDF 26.02 compatibility break that caused XGBoost to crash when ingesting cuDF categorical columns on GPU by adapting how category columns are converted to pylibcudf / Arrow device arrays.

Changes:

  • Update cudf_cat_inf to call to_pylibcudf() without the removed mode argument on newer cuDF versions.
  • Add a fallback path for older cuDF versions that still require/accept mode="read".

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python-package/xgboost/_data_utils.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python-package/xgboost/_data_utils.py Outdated
@trivialfis trivialfis requested a review from hcho3 April 3, 2026 11:54
@jameslamb jameslamb mentioned this pull request Apr 6, 2026
Comment thread python-package/xgboost/_data_utils.py Outdated
# pylint: disable=protected-access
arrow_col = cats._column.to_pylibcudf(mode="read")
if cudf_read_only():
arrow_col = cats._column.to_pylibcudf()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the use case for this call.

IIUC at this point, this function wants to return the cuda array interface for categories that are strings but it appears you are going through the arrow c (device) interface to get there?

Copy link
Copy Markdown
Member Author

@trivialfis trivialfis Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this. Yes, that's the intention of the function call. At the time this was suggested to me in an offline conversation, but if there's a better way to achieve this please share.

For context, XGBoost does re-coding in CUDA/C++ and needs to extract the data from cuDF python

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. The results from the pylibcudf.Column.data()/null_mask() functions expose a __cuda_array_interface__ attribute that you can use when composing jnames below. e.g.

In [1]: import pylibcudf as plc, pyarrow as pa

In [2]: plc_col = plc.Column.from_arrow(pa.array(["a", "b", None]))

# the data of the "values"
In [3]: plc_col.data().__cuda_array_interface__
Out[3]: 
{'shape': (2,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835543040, False),
 'version': 3}

# the null mask of the "values"
In [5]: plc_col.null_mask().__cuda_array_interface__
Out[5]: 
{'shape': (64,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542016, False),
 'version': 3}

# the "offsets"
In [6]: plc_col.children()[0].data().__cuda_array_interface__
Out[6]: 
{'shape': (16,),
 'strides': None,
 'typestr': '|u1',
 'data': (126954835542528, False),
 'version': 3}

You would probably need to perform the string dtype check on the cuDF Python object first since the typestr here doesn't indicate "string" e.g.

if not (cats._column.dtype == np.dtype("object") or isinstance(cats._column.dtype, pd.StringDtype)):
   raise TypeError(
            "Unexpected type for category index. It's neither numeric nor string."
        )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential question, I also see DfCatAccessor here indicates that you could be passing around an cudf/pd.Series.cat object. Is that intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intentional, mostly to get the cat accessor then pass it to these utilities to extract the C data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results from the pylibcudf.Column.data()/null_mask() functions expose a cuda_array_interface attribute that you can use when composing jnames below

Thank you for sharing, that's really helpful! I will look into it.

@trivialfis trivialfis changed the title Fix categorical data support with cuDF 2602. Fix support with cuDF 2602. Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python-package/xgboost/_data_utils.py Outdated
Comment thread ops/pipeline/get-image-tag.sh Outdated
## See https://xgboost.readthedocs.io/en/latest/contrib/ci.html#making-changes-to-ci-containers

IMAGE_TAG=main
IMAGE_TAG=PR-78
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is committing a PR-specific CI container tag (IMAGE_TAG=PR-78). If this merges as-is, downstream CI/users will likely fail once that ephemeral tag is removed or never published. Revert this to the standard tag (e.g., main) before merging, and keep any temporary CI tag override confined to the PR/CI configuration.

Suggested change
IMAGE_TAG=PR-78
IMAGE_TAG=main

Copilot uses AI. Check for mistakes.
Comment thread python-package/xgboost/_data_utils.py Outdated
@trivialfis trivialfis marked this pull request as ready for review April 18, 2026 11:10
@trivialfis trivialfis merged commit 5e7b49c into dmlc:master Apr 18, 2026
107 of 117 checks passed
@trivialfis trivialfis deleted the cudf-2602 branch April 18, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xgboost 3.2.0 crashes with cudf 26.02 when there are categorical features

3 participants