Apache Arrow Integration in mssql-python: Accelerating Data Loading and Interoperability

Fetching large datasets from SQL Server into DataFrame libraries like Polars has been a performance bottleneck due to row-by-row Python object creation. The latest update to mssql-python changes that by adding native support for Apache Arrow structures. This feature, contributed by community developer Felix Graßl, enables direct, zero-copy data transfer from SQL Server to any Arrow-compatible tool. Below, we answer common questions about this integration, its benefits, and how to get started.

What is Apache Arrow and why is it important for database drivers?

Apache Arrow is an open-source project that defines a standardized columnar in-memory format and a cross-language ABI (Application Binary Interface) called the Arrow C Data Interface. The core idea is zero-copy interoperability: different programming languages can share and manipulate the same data buffers without serialization or copying. For a database driver like mssql-python, this means the database's C++ layer can directly write query results into Arrow buffers—avoiding the creation of one Python object per row and eliminating garbage-collector overhead. The receiving Python library (e.g., Polars, Pandas with ArrowDtype) receives a pointer to that memory and can start processing immediately. This is a fundamental shift from traditional row-based drivers that produce millions of temporary objects, making Arrow the ideal foundation for high-throughput data pipelines.

Apache Arrow Integration in mssql-python: Accelerating Data Loading and Interoperability — Source: devblogs.microsoft.com

How does mssql-python now support Apache Arrow?

Starting with its latest release, mssql-python includes a new fetch path that delivers query results as Apache Arrow structures. Instead of the classic cursor fetch loop that converts each row’s columns into Python objects, the driver uses the Arrow C Data Interface to fill contiguous, typed buffers directly in C++. The driver’s Arrow-support was contributed by community developer Felix Graßl and is available in the mssql-python package. Users can enable this mode by setting a connection option or using a dedicated method to obtain an Arrow table. The result is that a million-row fetch from SQL Server can be received as a single Arrow table without any intermediate Python row objects, drastically reducing memory and CPU overhead.

What are the concrete benefits of using Arrow with mssql-python?

Four major improvements come with Arrow support: speed – the columnar path avoids per-row Python object creation, especially benefiting temporal types like DATETIME and DATETIMEOFFSET where Python-side conversions are eliminated; lower memory usage – a million integers occupy a single contiguous C array instead of a million Python int objects, each with its own allocation overhead; seamless interoperability – data can be consumed directly by Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face Datasets, and any other library supporting Arrow’s standard interface without conversion steps; and reduced garbage collection – because no millions of temporary Python objects are created, the garbage collector runs far less frequently, improving overall pipeline stability.

How does Arrow achieve zero-copy language interoperability?

Arrow’s secret lies in the Arrow C Data Interface—a cross-language ABI specification. This defines a stable, binary-level contract for how columnar data is laid out in memory, including array buffers, offsets, and null bitmaps. Any language can produce an Arrow structure by allocating memory according to this ABI and passing a simple C pointer (a struct ArrowArray). Another library, even written in a different language, can consume that pointer and interpret the bytes without copying or parsing. For example, a C++ database driver and a Python DataFrame library can be compiled separately but exchange data by pointing to the same allocated memory. This is fundamentally different from serialization formats like JSON or Protocol Buffers, which require encoding and decoding steps. Zero-copy makes Arrow extremely fast for data transfer between system components.

How does Arrow handle null values?

Arrow represents nulls using a bitmap—a compact boolean array where each bit indicates whether a value is valid (0 = null, 1 = present). This bitmap is stored separately alongside the actual data buffer. For a column of 1 million integers, a bitmap of only 125 KB is needed (1 million bits). This is far more memory-efficient than the typical Python approach of storing a None object for each null cell, which consumes significant overhead per object. The bitmap can be read and processed in bulk with SIMD operations, making null checking fast. In the context of mssql-python, when a SQL Server column contains NULLs, the Arrow fetch path writes the null-bitmap directly into the Arrow buffer, so downstream libraries like Polars or Pandas can treat missing values without any per-row work.

How can I start using Arrow with mssql-python in my projects?

Using the new Arrow feature requires mssql-python version 1.2.0 or later (check the release notes for exact version). To fetch results as Arrow, you typically create a connection and a cursor, execute a query, and then call a method like fetch_arrow_table() or set a connection property that enables the Arrow data path. The returned object is a PyArrow Table, which can be directly passed to Polars (pl.from_arrow()), Pandas (pa_table.to_pandas() with zero-copy if using ArrowDtype), or DuckDB (duckdb.from_arrow()). For example: rows = cursor.execute('SELECT * FROM my_table').fetch_arrow_table(). The driver handles all the C-level memory management, so you just work with the Arrow table as you would any other Arrow data.

What performance improvements can I expect for temporal types like DATETIME?

Temporal types in SQL Server—such as DATETIME, DATETIME2, and DATETIMEOFFSET—are particularly costly in traditional row-by-row fetching. Python must convert each timestamp value from the internal TDS format into a Python datetime object, which involves allocations and timezone handling. With Arrow, the driver can decode these values directly into Arrow’s native timestamp buffers (e.g., timestamp[ns]) without creating Python objects. When the results are consumed by a library like Polars, which natively understands Arrow timestamps, no conversion is needed at all. Benchmarks from early adopters show that fetching a million rows with DATETIME columns can be 2–3× faster than the old path, with memory usage reduced by more than 90% because no Python datetime objects are created.

Is the Arrow support backward compatible with existing mssql-python code?

Yes, the Arrow support is additive and does not break existing code. If you do not enable the Arrow fetch path, mssql-python continues to behave exactly as before, returning Python objects row by row. The new feature is opt-in—for example, by calling fetch_arrow_table() or setting a configuration parameter. This means you can gradually adopt Arrow in specific performance-critical queries while leaving legacy code unchanged. The underlying SQL Server communication protocol (TDS) remains the same; Arrow is just an alternative representation for the result data after it is received from the server. All existing cursor methods like fetchone(), fetchall(), and fetchmany() continue to work as before.