The best Python libraries for parallel processing

Wednesday October 23, 2024. 11:00 AM , from InfoWorld

Python is powerful, versatile, and programmer-friendly, but it isn’t the fastest programming language around. Some of Python’s speed limitations are due to its default implementation, CPython, being single-threaded. That is, CPython doesn’t use more than one hardware thread at a time.

And while you can use Python’s built-in threading module to speed things up, threading only gives you concurrency, not parallelism. It’s good for running multiple tasks that aren’t CPU-dependent, but does nothing to speed up multiple tasks that each require a full CPU. Python 3.13 introduced a special ‘free-threading’ or ‘no-GIL’ build of the interpreter to allow full parallelism with Python threads, but it’s still considered an experimental feature. For now, it’s best to assume threading in Python, by default, won’t give you parallelism.

Python does include another native way to run a workload across multiple CPUs. The multiprocessing module spins up multiple copies of the Python interpreter, each on a separate core, and provides primitives for splitting tasks across cores. But sometimes even the multiprocessing module isn’t enough.

In some cases, the job calls for distributing work not only across multiple cores but also across multiple machines. That’s where the Python libraries and frameworks discussed in this article come in. We’ll look at seven frameworks you can use to spread an existing Python application and its workload across multiple cores, multiple machines, or both.

The best parallel processing libraries for Python

Ray: Parallelizes and distributes AI and machine learning workloads across CPUs, machines, and GPUs.

Dask: Parallelizes Python data science libraries such as NumPy, Pandas, and Scikit-learn.

Dispy: Executes computations in parallel across multiple processors or machines.

Pandaral•lel: Parallelizes Pandas across multiple CPUs.

Ipyparallel: Enables interactive parallel computing with IPython, Jupyter Notebook, and Jupyter Lab.

Joblib: Executes computations in parallel, with optimizations for NumPy and transparent disk caching of functions and output values.

Parsl: Supports parallel execution across multiple cores and machines, along with chaining functions together into multi-step workflows.

Ray

Developed by a team of researchers at the University of California, Berkeley, Ray underpins a variety of distributed machine learning libraries. But Ray isn’t limited to machine learning tasks alone, even if that was its original intended use. You can break up and distribute any type of Python task across multiple systems with Ray.

Ray’s syntax is minimal, so you don’t need to rework existing applications extensively to parallelize them. The @ray.remote decorator distributes that function across any available nodes in a Ray cluster, with the option to specify parameters for how many CPUs or GPUs to use. The results of each distributed function are returned as Python objects, so they’re easy to manage and store, and the amount of copying across or within nodes is minimal. This last feature comes in handy when dealing with NumPy arrays, for instance.

Ray even includes a built-in cluster manager, which can automatically spin up nodes as needed on local hardware or popular cloud computing platforms. Other Ray libraries let you scale common machine learning and data science workloads, so you don’t have to manually scaffold them. For instance, Ray Tune lets you perform hyperparameter turning at scale for most common machine learning systems (PyTorch and TensorFlow, among others). And if all you want to do is scale your use of Python’s multiprocessing module, Ray can do that too.

Related video: Using the multiprocessing module to speed up Python

Dask

From the outside, Dask looks a lot like Ray. It, too, is a library for distributed parallel computing in Python, with a built-in task scheduling system, awareness of Python data frameworks like NumPy, and the ability to scale from one machine to many.

One key difference between Dask and Ray is the scheduling mechanism. Dask uses a centralized scheduler that handles all tasks for a cluster. Ray is decentralized, meaning each machine runs its own scheduler, so any issues with a scheduled task are handled at the level of the individual machine, not the whole cluster. Dask’s task framework works hand-in-hand with Python’s native concurrent.futures interfaces, so for those who’ve used that library, most of the metaphors for how jobs work should be familiar.

Dask works in two basic ways. The first is by using parallelized data structures—essentially, Dask’s own versions of NumPy arrays, lists, or Pandas DataFrames. Swap in the Dask versions of those constructions for their defaults, and Dask will automatically spread their execution across your cluster. This typically involves little more than changing the name of an import, but may sometimes require rewriting to work completely.

The second way is through Dask’s low-level parallelization mechanisms, including function decorators that parcel out jobs across nodes and return the results synchronously (in “immediate” mode) or asynchronously (“lazy” mode). You can also mix the modes as needed.

One Pythonic convenience Dask offers is a memory structure called a bag—essentially a distributed Python list. Bags provide distributed operations (like map and filter) on collections of Python objects, with whatever optimizations can be provided for them. The downside is that any operation that requires a lot of cross-communication between nodes (for example, groupby) won’t work as well.

Dask also offers a feature called actors. An actor is an object that points to a job on another Dask node. This way, a job that requires a lot of local state can run in-place and be called remotely by other nodes, so the state for the job doesn’t have to be replicated. Dask’s actor model supports more sophisticated job distribution than Ray can manage. However, Dask’s scheduler isn’t aware of what actors do, so if an actor runs wild or hangs, the scheduler can’t intercede. “High-performing but not resilient” is how the documentation puts it, so actors should be used with care.

Dispy

Dispy lets you distribute whole Python programs or just individual functions across a cluster of machines for parallel execution. It uses platform-native mechanisms for network communication to keep things fast and efficient, so Linux, macOS, and Windows machines work equally well. That makes it a more generic solution than others discussed here, so it’s worth a look if you need something that isn’t specifically about accelerating machine-learning tasks or a particular data-processing framework.

Dispy syntax somewhat resembles Python’s multiprocessing module in that you explicitly create a cluster (where multiprocessing would have you create a process pool), submit work to the cluster, then retrieve the results. Modifying jobs to work with Dispy may require a little more work, but you gain precise control over how those jobs are dispatched and returned. For instance, you can return provisional or partially completed results, transfer files as part of the job distribution process, and use SSL encryption when transferring data.

Pandaral·lel

Pandaral·lel, as the name implies, is a way to parallelize Pandas jobs across multiple machines. The downside is that Pandaral·lel works only with Pandas. But if Pandas is what you’re using, and all you need is a way to accelerate Pandas jobs across multiple cores on a single computer, Pandaral·lel is laser-focused on the task.

Note that while Pandaral·lel does run on Windows, it will run only from Python sessions launched in the Windows Subsystem for Linux. Linux and macOS users can run Pandaral·lel as-is. Also note that Pandaral·lel currently does not have a maintainer; its last formal release was in May 2023.

Ipyparallel

Ipyparallel is another tightly focused multiprocessing and task-distribution system, specifically for parallelizing the execution of Jupyter Notebook code across a cluster. Projects and teams already working in Jupyter can start using Ipyparallel immediately.

Ipyparallel supports many approaches to parallelizing code. On the simple end, there’s map, which applies any function to a sequence and splits the work evenly across available nodes. For more complex work, you can decorate specific functions to always run remotely or in parallel.

Jupyter notebooks support “magic commands” for actions that are only possible in a notebook environment. Ipyparallel adds a few magic commands of its own. For example, you can prefix any Python statement with %px to automatically parallelize it.

Joblib

Joblib has two major goals: run jobs in parallel, and don’t recompute results if nothing has changed. These efficiencies make Joblib well-suited for scientific computing, where reproducible results are sacrosanct. Joblib’s documentation provides plenty of examples for how to use all its features.

Joblib syntax for parallelizing work is simple enough—it amounts to a decorator that can be used to split jobs across processors or to cache results. Parallel jobs can use threads or processes.

Joblib includes a transparent disk cache for Python objects created by compute jobs. This cache not only helps Joblib avoid repeating work, as noted above, but can also be used to suspend and resume long-running jobs, or pick up where a job left off after a crash. The cache is also intelligently optimized for large objects like NumPy arrays. Regions of data can be shared in-memory between processes on the same system by using numpy.memmap. This all makes Joblib highly useful for work that may take a long time to complete, since you can avoid redoing existing work and pause and resume as needed.

One thing Joblib does not offer is a way to distribute jobs across multiple separate computers. In theory, it’s possible to use Joblib’s pipeline to do this, but it’s probably easier to use another framework that supports it natively.

Parsl

Short for “Parallel Scripting Library,” Parsl lets you take computing jobs and split them across multiple systems using roughly the same syntax as Python’s existing Pool objects. It also lets you stitch together different computing tasks into multi-step workflows, which can run in parallel, in sequence, or via map/reduce operations.

Parsl lets you execute native Python applications, but also run any other external application by way of commands to the shell. Your Python code is written like normal Python code, save for a special function decorator that marks the entry point to your work. The job-submission system also gives you fine-grained control over how things run on the targets—for example, the number of cores per worker, how much memory per worker, CPU affinity controls, how often to poll for timeouts, and so on.

One excellent feature Parsl offers is a set of prebuilt templates to dispatch work to a variety of high-end computing resources. This not only includes staples like AWS or Kubernetes clusters, but supercomputing resources (assuming you have access) like Blue Waters, ASPIRE 1, Frontera, and so on. (Parsl was co-developed with the aid of many of the institutions that built such hardware.)

Conclusions

If you’re working with existing, popular machine learning libraries you want to run in a distributed way, Ray is your first-round draft choice. For the same kind of work, but with centralized scheduling (for instance, to have top-down control over the submitted jobs), use Dask. If you want to work specifically with Pandas, use Pandaral·lel; for parallel distribution of work in Jupyter notebooks, use Ipyparallel. And, for more generic Python workloads, Dispy, Joblib, and Parsl can provide parallel task distribution without too many extras.

Python’s limitations with threads will continue to evolve, with major changes slated to allow threads to run side-by-side for CPU-bound work. But those updates are years away from being usable. Libraries designed for parallelism can help fill the gap while we wait.