Louisville, KY /

DerbyPy Monthly Meetup

Sullivan School of Technology and Design 3903 Atkinson Square Drive , Louisville, ky 40218 (map)

At our June Meetup Alex Hagerman will be leading a talk entitled:

PyArrow: Columnar Anywhere.

Here is an outline of his talk:

How many times have you needed to load a flat file, but you don’t know the delimiter or the delimiter wasn’t properly escaped? How many times have you had to provide Pandas the type for 15+ columns from a file? Or think about the times you needed to read 3 columns from a 50+ column flat file and ran into memory management issues. Good news I have an easy to use library to help you deal with all of this and more. I want to tell you about PyArrow the Python implementation of the Apache Arrow project.

What is Apache Arrow, Parquet and columnar data?

• Apache Arrow is an open source in memory columnar format

• Apache Parquet is a open source columnar storage format

• Columnar and rowwise two different approaches to data storage and


What does PyArrow do for you?

• A combination of Apache Parquet and Apache Arrow wrapped up for easy use in a Python package.

• Smaller files and memory footprint because of better compression

• Files have a schema encoding

• Schema is enforced in memory and on disk without using a database (not to discourage database use I love them and spent about 3 years as a SQL programmer)

• Zero serialization, deserialization and copy across process

How do you use it?

• pip install pyarrow OR conda install -c conda-forge pyarrow

• pyarrow in memory objects

• using it with pandas and parquet

What else should you know or what other cool stuff can I do?

• Use it for an efficient in memory cache

• Interchange between languages with zero cost

• Extensible

• HDFS interactions

How can you contribue?

• This is an Apache Software Foundation (ASF) project


