accelerated-computing-cudf

Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads.

Skill file

Preview skill file↓↑

---
name: accelerated-computing-cudf
description: Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads.
license: CC-BY-4.0 AND Apache-2.0
metadata:
  author: NVIDIA
  tags:
    - cudf
    - dataframes
    - pandas
    - dask-cudf
    - etl
---

# cuDF & dask-cuDF Implementer's Guide

## Compatibility

- Release tracked by this skill: 26.04.
- Requires NVIDIA Volta or newer on CUDA 12, or Turing or newer on CUDA 13. Release 26.04 supports CUDA 12.2-12.9 with driver 535+ or CUDA 13.0-13.1 with driver 580+, and Python 3.11-3.14. cuDF sweet spot: >100K rows.

## Naming

Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.

## Role

You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user's intent: `cudf.pandas` for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.

## Critical Rules

1. **Choose the right cuDF path.** Use `cudf.pandas` for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations.
2. **Size gate: 100K rows minimum.** Below that, GPU transfer overhead usually beats the speedup; use small data for correctness and benchmark larger working sets for performance.
3. **Keep conversions at boundaries.** Use `.to_pandas()`, `.values`, or `.numpy()` for display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU.
4. **Float32 is your friend.** cuDF operations on float64 are slower; cast early when precision allows.
5. **Validate semantics on representative slices.** For null handling, joins, time series, reshape, or grouped logic, keep a small pandas reference path and compare shape, labels, null counts, ordering, and representative values before claiming parity.
6. **For data > GPU memory**, move to dask-cuDF with `enable_cudf_spill=True`. See `references/dask-cudf-patterns.md`.

## Three Paths to GPU DataFrames

### Path 1: cudf.pandas Accelerator (Compatibility / Minimal Change)

Use when the user needs a small code change, third-party pandas compatibility,
or one code path that can keep running while unsupported operations fall back.

**Jupyter/IPython:**
```python
%load_ext cudf.pandas
import pandas as pd   # now GPU-backed; falls back silently for unsupported ops
```

**Script:**
```bash
python -m cudf.pandas my_script.py
```

**With multiprocessing:**
```python
import cudf.pandas
cudf.pandas.install()   # must come BEFORE pandas import, before Pool creation
from multiprocessing import Pool
```

Confirm acceleration with the cudf.pandas profiler before claiming speedup.
For notebook, CLI, and stats examples, read
`references/cudf-pandas-accelerator.md`. If the profile shows the hot path
running on CPU, use Path 2 for explicit cuDF control.

### Path 2: Explicit cuDF API

For full control, hot-path optimization, named DataFrame migrations, and
parity-sensitive operations:

```python
import cudf

# Read data directly to GPU
df = cudf.read_parquet("data.parquet")

# Operations mirror pandas
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]

# String operations
df["clean"] = df["name"].str.strip().str.lower()

# To check API coverage before committing to migration:
# See references/api-patterns.md for known gaps and workarounds
```

**Keep data on GPU end-to-end.** Only call `.to_pandas()` at the very end for display or CPU or non-GPU handoff.

Prefer explicit cuDF for tasks involving `read_csv`/`read_parquet`, joins,
groupby, reshape, nullable types, `fillna`/`where`, time buckets, rolling
windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when
semantics matter instead of relying on successful execution alone.

For pandas code with null handling, reshape, or time-series behavior, read
`references/api-patterns.md` for the relevant semantic checklist before
rewriting. A `cudf.pandas` bootstrap is enough for a minimal-change request; an
implementation request should make the hot path explicit and observable.

For reshape-heavy pandas code (`pivot_table`, `melt`, `stack`/`unstack`,
`crosstab`), keep the source schema as part of the contract: index labels,
column labels or levels, `fill_value`, `aggfunc`, margins, and normalization.
Use explicit cuDF where the equivalent is supported; use `cudf.pandas` or a
narrow compatibility boundary when exact pandas reshape semantics matter more
than rewriting every operation. Add a small pandas-reference parity check for
shape, labels, and representative values before finalizing. See
`references/api-patterns.md`.

### Path 3: dask-cuDF (Multi-GPU / Large Data)

When dataset exceeds GPU memory. See `references/dask-cudf-patterns.md` for full patterns.

```python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf

cluster = LocalCUDACluster(enable_cudf_spill=True)  # one worker per GPU
client = Client(cluster)

ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()
```

## Memory Management

**Enable spill before OOM happens** (not after):
```python
import cudf
cudf.set_option("spill", True)   # spill to host RAM when GPU is full
```

**RMM pool allocator** (reduces cudaMalloc overhead in pipelines with many allocations):
```python
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
# Must be called BEFORE any cuDF operations
```

| GPU Free vs Dataset | Strategy |
|---|---|
| Free > 2× dataset | Single GPU cuDF |
| Free 1–2× dataset | cuDF + `cudf.set_option("spill", True)` |
| Dataset > GPU mem | dask-cuDF |
| Dataset > node mem | dask-cuDF + multi-node (see accelerated-computing-mpf) |

## Troubleshooting

**No speedup vs pandas:**
- Data < 100K rows? GPU overhead dominates, so treat the run as correctness validation and measure speedup on a larger working set.
- Run `%%cudf.pandas.profile` — high CPU % means many fallbacks. Identify and fix those ops.
- Check `references/api-patterns.md` for known gaps.

**OOM (CUDA out of memory):**
1. Enable spill: `cudf.set_option("spill", True)`
2. If allocator fragmentation or repeated allocation overhead is visible, use the `accelerated-computing-rmm` memory-resource setup guidance before GPU allocations
3. Still failing: move to dask-cuDF

**AttributeError / NotImplementedError:**
- Check `references/api-patterns.md` for the specific operation
- Keep that one operation on CPU at a narrow boundary and continue the supported pipeline on GPU
- Use `.to_pandas()` only for the unsupported op, then `.from_pandas()` back

**Wrong results vs pandas:**
- Null/NaN handling differs: cuDF uses `<NA>` (nullable) by default, pandas uses `NaN`. See `references/api-patterns.md`.
- Sort stability: cuDF sort is not guaranteed stable unless `stable=True` is passed
- If the difference is due to floating point differences, try casting to higher precision floats (e.g. `float64` instead of `float32`). If the results are still different, stop. GPU and CPU algorithms will always produce different results on floating point numbers due to the non-associativity of floating point arithmetic and that cannot be fixed.

## Nullable and Fill Semantics

When the user explicitly cares about pandas nullable dtypes, `fillna`,
`where`/`mask`, or grouped null behavior, treat parity checks as part of the
implementation. See `references/api-patterns.md` for nullable dtype examples.

- Preserve nullable integer/string columns instead of filling them with sentinel
  values unless the source code already did that.
- Keep `where`/`mask` semantics when they encode a condition. Use broad
  `fillna` only when the condition is exactly null-only.
- Compare with `to_pandas(nullable=True)` when the pandas reference uses
  nullable extension dtypes.
- Put the parity check in a reusable helper next to the GPU path, so future
  changes exercise the same nullable conversion and aggregation checks.
- Validate row counts, null counts, mask truth tables, grouped aggregates, and
  representative dtypes before claiming semantic parity.

## Reference Files

- `references/cudf-pandas-accelerator.md` — Profiling, fallback detection, cudf.pandas deep dive
- `references/api-patterns.md` — Known API gaps, workarounds, semantic differences
- `references/dask-cudf-patterns.md` — Multi-GPU patterns, best practices, partition tuning

## External Documentation

Use WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.

- **cuDF Documentation:** https://docs.rapids.ai/api/cudf/stable/
- **dask-cuDF API Reference:** https://docs.rapids.ai/api/dask-cudf/stable/api/
- **GitHub:** https://github.com/rapidsai/cudf
- **CHANGELOG:** https://github.com/rapidsai/cudf/blob/main/CHANGELOG.md

Source

Creator's repository · nvidia/skills

View on GitHub ↗

License: CC-BY-4.0 AND Apache-2.0

Security

Security checks in progress

Results will appear here once audits complete

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk