Join for free and connect with our local tech scene
Stay on top of the latest companies and upcoming events with our weekly newsletter, and be counted among the people building the future of your local tech community.
At our June Meetup Alex Hagerman will be leading a talk entitled:
PyArrow: Columnar Anywhere.
Here is an outline of his talk:
How many times have you needed to load a flat file, but you don’t know the delimiter or the delimiter wasn’t properly escaped? How many times have you had to provide Pandas the type for 15+ columns from a file? Or think about the times you needed to read 3 columns from a 50+ column flat file and ran into memory management issues. Good news I have an easy to use library to help you deal with all of this and more. I want to tell you about PyArrow the Python implementation of the Apache Arrow project.
What is Apache Arrow, Parquet and columnar data?
• Apache Arrow is an open source in memory columnar format
• Apache Parquet is a open source columnar storage format
• Columnar and rowwise two different approaches to data storage and
What does PyArrow do for you?
• A combination of Apache Parquet and Apache Arrow wrapped up for easy use in a Python package.
• Smaller files and memory footprint because of better compression
• Files have a schema encoding
• Schema is enforced in memory and on disk without using a database (not to discourage database use I love them and spent about 3 years as a SQL programmer)
• Zero serialization, deserialization and copy across process
How do you use it?
• pip install pyarrow OR conda install -c conda-forge pyarrow
• pyarrow in memory objects
• using it with pandas and parquet
What else should you know or what other cool stuff can I do?
• Use it for an efficient in memory cache
• Interchange between languages with zero cost
• HDFS interactions
How can you contribue?
• This is an Apache Software Foundation (ASF) project