RAPIDS cuDF for accelerated data science on Google Colab
Update: This blog was written before RAPIDS cuDF was available by default on Colab. RAPIDS cuDF now accelerates pandas code by up to 50x on Colab with zero code changes, and is available in the default runtime environment. Just add in %load-ext cudf.pandas over pandas code. Try acceleration on pandas workloads today with the 10 minute guide.
NVIDIA GPUs have become one of the most effective ways to accelerate computationally intensive machine learning tasks. Now, thanks to RAPIDS cuDF, GPUs can also turbocharge your data analysis work.
What is RAPIDS cuDF?
RAPIDS cuDF is an open-source, GPU-accelerated dataframe library that implements the familiar pandas API for processing and analyzing your data. The Python cuDF interface is built on libcudf, the CUDA/C++ computational core that accelerates fundamental data operations from ingestion and parsing, to joins, aggregations, and more. For some workloads, you will find that switching from import pandas
to import cudf
accelerates your workloads and can lead to data processing speedups of 10x or more.
For example, a simple join operation can go from 761ms to 27ms simply by switching to cuDF:
Getting started with RAPIDS on Colab
Now it’s easier than ever to get started with RAPIDS on Colab. With Colab’s default runtime update to Python 3.8 and the new RAPIDS pip packages, you can try out NVIDIA GPU-accelerated data science right in your browser. Running RAPIDS on Colab requires just two quick steps:
- First, select a Colab runtime that uses a GPU accelerator. Navigate to the “Runtime” menu and select “Change runtime type,” then choose “GPU” from the dropdown and click “Save.” The NVIDIA GPU that you receive from Colab may vary across sessions, — including both newer GPUs and older generations. With the new “Pay As You Go” Tier in Colab, you now have the option to upgrade your runtime to “Premium GPUs” with Colab Pro, enabling access to more powerful NVIDIA A100 or V100 Tensor Core GPUs. See Google’s blog post for more information on GPU availability.
- Second, install RAPIDS cuDF in your notebook. With the new RAPIDS pip packages, this step is easier than ever. Execute the following command in a code block and you will be set up to run RAPIDS. Make sure to restart your runtime after the installation completes:
!pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
!rm -rf /usr/local/lib/python3.8/dist-packages/cupy*
!pip install cupy-cuda11x
Finally, check that import cudf completes successfully in a new code block, and then you are ready to go. If you run into any trouble, please reach out in the RAPIDS Slack and we’ll help you get things working correctly.
Running 10 minutes to cuDF on Colab
Now that you have a working cuDF installation and a GPU, you can run our tutorial notebook, “10 minutes to cuDF.” This notebook is inspired by a similar guide from the Pandas community and is a streamlined version of our full notebook, “10 Minutes to cuDF and Dask-cuDF.”
Running through the notebook, you will find examples of dataframe creation, data filtering, transformation, joins, aggregations and more. We’ve also included file reading and writing examples for Parquet, ORC and CSV formats. As you investigate more complex data processing, we hope that you use this as a companion to cuDF’s documentation.
Exploring the rest of RAPIDS
When you are ready to dive deeper, RAPIDS also includes Dask-cuDF for large workflows, cuML for scikit-learn-compatible, accelerated machine learning, and cuGraph for graph data analytics. Update your Colab notebook with the extended installation list, as shown in the following code block, and you’ll be ready to use the complete toolkit.
!pip install cudf-cu11 dask-cudf-cu11 cuml-cu11 cugraph-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
!rm -rf /usr/local/lib/python3.8/dist-packages/cupy*
!pip install cupy-cuda11x
Here are some additional RAPIDS notebooks you can explore to learn more about RAPIDS:
If you’d like to see more real-world examples, here are recent articles illustrating RAPIDS data science tools in action: