CSV is common in the data science space as it is human-readable, less verbose than options like JSON and XML, and super-easy to produce from almost any tool. However, the format is usually underspecified and CSV files have terrible compression and performance. There are many file formats more suitable for working with tabular data. This post looks at one of them, Apache Parquet, and shows how it is better in both compression and performance with examples.
Tuesday, March 26, 2024