Talk

Datanomy: Understanding the Anatomy of Arrow and Parquet

Thursday, May 28

16:15 - 16:45
RoomTortellini
LanguageEnglish
Audience levelIntermediate
Elevator pitch

Apache Arrow and Apache Parquet are the de-facto standards for in-memory and file formats for columnar data but they are quite unknown. The talk will try to demistify presenting the formats in a simpler visual way.

Abstract

Apache Arrow and Apache Parquet are the de-facto standards for in-memory and file formats for columnar data. We use them on our data workflows daily, sometimes even without noticing. This talk will demystify these formats by going over the specifications and showing some real world examples of how the data actually looks on our systems.

For Arrow, we will briefly explore some of the batteries included for exchanging data like Arrow IPC, Arrow Flight or Arrow ADBC and how they relate to the core In-memory format.

The main libraries shown will be PyArrow with its Arrow and Parquet implementation and datanomy a new tool created for visualizing data formats.

TagsData Engineering, Data Science & Data Visualisation
Participant

Raúl Cumplido

I started working with Python in 2008 with Python 2.5 and since then it became my language of choice. I have been involved in the Spanish Python community being one of the co-founders of the Python Spanish Association. I have been involved in the organization of EuroPython in Bilbao, several PyCon ES (Spain) and the Barcelona meetup. A couple of years ago I started working in Apache Arrow and since then I have become a committer and a PMC member. I want to share what we have done and what we are doing in the Project.

Participant

Alenka Frim

My software development journey began with the open-source and the Apache Arrow project. In 2021, I made my first contribution to the Arrow R package, an experience that sparked my interest in software development and open-source collaboration. During my internship at Quansight, I was introduced to the Python DataFrame API standard, which deepened my understanding of interoperability challenges.

In 2022, after over a year of contributions, I became an Apache Arrow committer, primarily focusing on the Python implementation. I continued my work as a PyArrow maintainer at Voltron Data until mid-2024.

Apache Arrow remains the project I’m most passionate about, and I’m still actively involved in its development as a freelancer.