Unlocking Engine Interoperability: Writing and Querying Apache Iceberg Tables in Snowflake Open Catalog with PyIceberg and Daft

6 min readDec 24, 2024

Apache Polaris is currently undergoing Incubation at the Apache Software Foundation

1. Overview

In today’s rapidly evolving data ecosystem, the ability to seamlessly leverage multiple engines and frameworks on a unified platform is essential for achieving scalability, flexibility, and performance. Snowflake’s Open Catalog addresses this need by providing a robust cataloging solution for Apache Iceberg™ tables. Built on the open-source Apache Polaris™ (incubating) implementation and Apache Iceberg™ REST protocol, it enables efficient metadata management and cross-platform interoperability, streamlining data operations across diverse tools and technologies.

With this in mind, a natural question arises: Is it possible to read from and write to Iceberg tables with different engines managed by Snowflake Open Catalog? This blog explores how to write Iceberg tables into Snowflake Open Catalog using PyIceberg, and then leverage a different engine, such as Daft, to perform some data operations on the same table before writing it into the same table.

2. What is Snowflake Open Catalog?

Snowflake Open Catalog is a managed service for Apache Polaris™ (incubating) that enables efficient integration and secure management of open lakehouses. It provides a seamless way to integrate various processing engines like Apache Doris, Apache Flink, Apache Spark, Daft, Presto, Trino, StarRocks, Snowflake, and more. And with PyIceberg, Apache Iceberg’s Python implementation for accessing Iceberg tables, you can read and write Iceberg tables with Snowflake Open Catalog using a variety of other Python libraries and frameworks like Pandas, Polars, and DuckDB.

Snowflake Open Catalog allows teams within your organization to collaborate on data lakes with consistent role-based access controls across multiple engines, both for reading and writing. In summary, key features are:

Discoverability: Iceberg tables across multiple sources
Interoperability: Leveraging Apache Polaris™ (incubating), Snowflake Open Catalog offers flexibility as your organization scales, making it easier to integrate new engines and apply consistent governance controls as your needs evolve.

3. What is Daft?

Daft is a fast, flexible DataFrame library designed for high-performance data processing and manipulation within Python. It is optimized for modern hardware (e.g., multi-core processors) and can operate on large datasets, supporting distributed execution to scale with data. The Daft library is built to work efficiently with Apache Arrow (a cross-language development platform for in-memory data) and integrates with popular frameworks like PyIceberg, Dask and Ray.

Key features of Daft include:

Speed: Designed to deliver fast data operations, particularly on large datasets.
Python-native API: Provides an easy-to-use, Pythonic API for manipulating DataFrames, similar to pandas but optimized for scalability. No JVM dependency hell and no Spark involved at all.
Distributed Execution: Supports parallel and distributed processing, making it ideal for big data workflows.
Integration with Apache Arrow: Leverages Arrow for high-performance columnar memory, allowing efficient execution of analytics tasks.

Daft aims to make data processing faster and more accessible for Python developers, and its ability to interface with Snowflake’s Open Catalog makes it a powerful tool for modern data workflows.

For more details, check out Daft’s official site.

4. How Do We Create Tables in Snowflake Open Catalog Using PyIceberg?

I won’t go into too much into detail on how to set up the Open Catalog, as there are comprehensive guides available on this topic.

A really good comprehensive reading guide is Scott Teal guide on how to integrate PyIceberg to Open Catalog

To give an overview, in this step we will be leveraging PyIceberg to create tables in Snowflake Open Catalog. Snowflake offers seamless integration with PyIceberg via the Open Catalog interface, enabling you to easily create and manage tables directly within the catalog. We will be:

Following Scott’s guide on setting up Open Catalog
Use PyIceberg to create a table in Open Catalog with a dataset (In this guide I will be using the famous yellow_tripdata_2023–01.parquet to load the data but you can use any data you prefer). The yellow_tripdata_2023–01.parquet has about 3 million records of data

Steps to Create Tables Using PyIceberg in Snowflake Open Catalog:

a) Read Parquet data with PyIceberg and write into Snowflake Open Catalog

from pyiceberg.catalog import load_catalog
import pyarrow.parquet as pq

df = pq.read_table("yellow_tripdata_2023-01.parquet")

catalog = load_catalog('demo_catalog', \
                      **{'uri': 'https://xxxxx.snowflakecomputing.com/polaris/api/catalog', \
                         'warehouse': 'demo_catalog', \
                         'credential': '<CREDENTIAL_DETAILS>', \
                         'scope':'PRINCIPAL_ROLE:ALL', \
                         'client.region':'us-west-2'})

catalog.create_namespace("taxi")
table = catalog.create_table(
    "taxi.taxi_dataset",
    schema=df.schema,
)

table.append(df)
len(table.scan().to_arrow())

** Do adjust the client.region to the bucket region where your files are residing. In this example, the files I’m persisting are in a bucket inside AWS, us-west-2.

You can then verify the records by querying the catalog.

catalog.list_tables("taxi")
table_query = catalog.load_table("taxi.taxi_dataset")
table_query

b) Verify in Open Catalog: You can also verify the table creation by accessing the Snowflake Open Catalog under the designated schema. This allows you to inspect the table, review its metadata, and query it using other engines such as Daft or PyIceberg.

5. Reading from Open Catalog Using Daft

Once you have data available in Snowflake Open Catalog, you can easily access it with Daft to perform fast DataFrame operations in Python. Daft provides a straightforward method to connect to Snowflake, read data, and interact with it efficiently.

Steps to Read Data from Open Catalog Using Daft:

a) Install Daft: First, ensure Daft is installed. You can do this using pip:

pip install getdaft

b) Connect to Snowflake Open Catalog: Configure your Snowflake credentials to access the data stored in the Open Catalog by fetching the <CREDENTIAL_DETAILS>

catalog = load_catalog('demo_catalog', \
                      **{'uri': 'https://xxxxx.snowflakecomputing.com/polaris/api/catalog', \
                         'warehouse': 'demo_catalog', \
                         'credential': '<CREDENTIAL_DETAILS>', \
                         'scope':'PRINCIPAL_ROLE:ALL', \
                         'client.region':'us-west-2'})

c) Load Data into Daft: Use Daft’s API to read data directly from Snowflake Open Catalog. Typically, you will create a Daft DataFrame from a Snowflake table:

table = catalog.load_table("taxi.taxi_dataset")
df = daft.read_iceberg(table)
df.show()

d) Perform Operations on Data: Once the data is loaded into a Daft DataFrame, you can perform various operations like filtering, aggregating, and transforming the data efficiently. Daft is optimized for speed, so operations on large datasets should be fast and scalable. For example, we can perform some operations on a dataframe such as extracting the year from a column.

df_year = df.with_column("tpep_pickup_datetime_year", df["tpep_pickup_datetime"].dt.year()) df_year.show()

For this walkthrough, we will be choosing a single record from our existing data set and then we will be incrementing the VendorID column.


df_write = daft.sql("select * from df limit 1")
df_write = df_write.with_column("VendorID", daft.col("VendorID")+1)
df_write.show()

e) Write Results Back to Open Catalog : After processing the data, you can write the resulting DataFrame back to the same table in Open Catalog. The below code is converting it back to arrow but in future, Daft will be working to write to Open Catalog / Apache Polaris directly.

df_write_pyarrow = df_write.to_arrow()
table.append(
  df=df_write_pyarrow
)

You can do a count of the records thereafter to see the incremental changes done. Pretty neat.

And there we have it. This integration allows you to leverage dual engines while taking advantage of Snowflake Open Catalog for table operations, management, and governance — all stored in the highly interoperable Iceberg format.

Conclusion

Snowflake Open Catalog and Apache Polaris™ (incubating) provide a simple yet powerful tool for managing and discovering data across multiple engines and frameworks.

By giving teams the choice in integrating various compute engines like Daft (for reading and processing) and others (with both SQL and Python interfaces), users can efficiently manage and process large datasets, creating a scalable and collaborative data workflow that bridges the gap between storage, computation, and analysis.