Dataframes explained: The modern in-memory data science format

Wednesday November 6, 2024. 10:00 AM , from InfoWorld

Most people are familiar with data in the form of a spreadsheet, with labeled columns of different data types such as name, address, age, and so on. Databases work the same way, with each table laid out according to a strict schema.

Dataframes are structures used in data science that work like spreadsheet pages or database tables. Many common data science libraries use dataframes in some form—Spark, Pandas, Polars, and many more. But dataframes are far more efficient and powerful than working with databases through an SQL query or reading an Excel spreadsheet via a libary. In fact, you can create dataframes from any of those data sources and more. Then you can use the imported data with far greater speed, thanks to the way dataframes work.

The basics of dataframes

Dataframes are two-dimensional data structures, with rows of data elements organized into named columns that hold specific data types. In that sense they’re closer to the way databases work, as spreadsheets are more intentionally freeform. But dataframes have many of the same conveniences of both—for instance, you can access a column by its name rather than a mere index position.

Each dataframe typically has a schema: a description of the name and perhaps also the data type for each column. The data types supported by dataframes ought to be familiar to most programmers—integers, floating-point numbers, strings, and so on. You can also store empty or null values in a dataframe, in the same way a spreadsheet can hold an empty cell or a database can have a NULL value.

Some dataframes allow you to specify types for a column to keep data consistent. You can’t put string data in a column for integers, for instance, but you can leave a column untyped if you must—again, as a convenience.

Dataframes can be created by importing data from an existing source, or through a programmatic interface. For instance, in Python, the Pandas library lets you use Python data structures as the basis for creating dataframes:

import pandas as pd
data = {
'Title': ['Blade Runner', '2001: a space odyssey', 'Alien'],
'Year': [1982, 1968, 1979],
'MPA Rating': ['R','G','R']
}
df = pd.DataFrame(data)

Applications that use dataframes

As I previously mentioned, most every data science library or framework supports a dataframe-like structure of some kind. The R language is generally credited with popularizing the dataframe concept (although it existed in other forms before then). Spark, one of the first broadly popular platforms for processing data at scale, has its own dataframe system. The Pandas data library for Python, and its speed-optimized cousin Polars, both offer dataframes. And the analytics database DuckDB combines the conveniences of dataframes with the power of a full-blown database system.

It’s worth noting the application in question may support dataframe data formats specific to that application. For instance, Pandas provides data types for sparse data structures in a dataframe. By contrast, Spark does not have an explicit sparse data type, so any sparse-format data needs an additional conversion step to be used in a Spark dataframe.

To that end, while some libraries with dataframes are more popular, there’s no one definitive version of a dataframe. They’re a concept implemented by many different applications. Each implementation of a dataframe is free to do things differently under the hood, and some dataframe implementations vary in the end-user details, too.

For instance, Spark dataframes (or DataFrames, as they’re called in Spark) do not support strongly typed columns. Every column is considered a Java object. Another Spark data type, the Dataset, adds typing guarantees for columns as a way to enable optimized processing.

Another under-the-hood example: While dataframes can in theory use any kind of under-the-hood data layout, many of them use columnar storage. Data is laid out in a column-wise format, instead of row-wise. This speeds up processing the data, at the cost of making it slower to insert rows (assuming that’s even allowed). If you’re curious about under-the-hood implementation details, this talk at PyData Seattle 2015 describes how Pandas implements dataframes.

Advantages of dataframes

The features commonly expected of dataframes make them appealing for many reasons:

High-performance operations: Dataframes provide convenience methods for selecting, sorting, and transforming data at scale. Most dataframe implementations come with methods for filtering, transposing, or performing pivot-table-like operations. What’s most crucial is these methods don’t require the developer to iterate through the dataframe, so they can be applied as fast, atomic operations.

In-memory processing: Dataframes tend to reside in-memory by default for speed. Some dataframe applications like Daft or Dask support dataframes larger than system memory. However, the vast majority of work in dataframes happens in-memory instead of on-disk. Likewise, with DuckDB, data can be kept on disk for persistence, but processed in-memory as per other data science libraries.

A convenient and consistent metaphor: Working with a dataframe is less abstract or cumbersome than working with a simple homogenous data array (such as a NumPy array). Again, it’s more akin to working with a spreadsheet or a database. Once you’ve worked with a dataframe in one place, the behavior and concepts port easily to other incarnations of dataframes.

The consistent metaphor provided by dataframes is the most crucial item on this list. As frameworks and libraries for data science proliferate, there’s a growing need for consistency and familiar footing. Dataframes bring conceptual consistency to working with data. The dataframes concept is bigger than any library that uses them. Dataframes will also likely outlast most libraries that implement them.