Useful resources for using the Parquet format
- Apache Arrow - A library with support for reading and writing Parquet files, with multiple packages for C++, Java, JavaScript, Python, R, Rust, and more.
- DuckDB - An in-process database library that supports reading and writing Parquet files, with multiple packages for C, Java, Python, R, JavaScript (WASM), and more.
- parquet - A Go library for reading and writing Parquet files.
- parquet-carpet - A Java library for serializing and deserializing Parquet files efficiently using Java records.
- parquet-java - A Java implementation of the Parquet format, owned by the Apache Software Foundation.
- hyparquet - A lightweight, dependency-free, pure JavaScript library for parsing Apache Parquet files.
- parquet-wasm - WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet and arrow crates.
- fastparquet - A Python implementation of the Parquet columnar file format.
- petastorm - Petastorm is a library enabling the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks.
- dask - Dask is a flexible parallel computing library for analytics that can efficiently load and process multiple Parquet files as a unified dataset, enabling distributed computations on datasets larger than memory.
- nanoparquet - A reader and writer for a common subset of Parquet files.
- Polars - A DataFrame interface on top of an OLAP Query Engine that supports reading and writing Parquet files, with bindings for Python.
- DuckDB CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
- parquet-tools - Python-based CLI tool for exploring parquet files (part of Apache Arrow).
- parquet-cli - Java-based CLI tool for exploring parquet files.
- parquet-cli-standalone - A JAR file for the parquet-cli tool which can be run without any dependencies.
- Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Tabiew - A lightweight TUI application to view and query tabular data files, such as CSV, TSV, and parquet.
- Pink Parquet - A free and open-source, user-friendly viewer for Parquet files for Windows.
- Tad - An application for viewing and analyzing tabular data sets.
- nf-parquet - A Nextflow plugin able to read and write parquet files.
- ChatDB - Online tools for viewing and converting from and to Parquet files.
- DataConverter.io - Online tools for viewing, converting, and transforming Parquet files.
- Datasette - A tool to explore datasets, with support for reading Parquet files.
- Onyxia Data Explorer - A web-based tool to explore Parquet files in the browser.
- Quak - A scalable data profiler for quickly scanning large tables.
- icem7 - Un blog sur les outils de data science, avec des articles de fond sur Parquet.
- Hyparquet: The Quest for Instant Data - 6 optimization tricks to read Parquet files faster in the browser.
- Querying Parquet with Precision Using DuckDB - Describes how DuckDB optimizes queries to a Parquet file using projection & filter pushdown.
- Why Parquet Is the Go-To Format for Data Engineers - A graphical description of the Parquet format with optimization and best practices.
- Parquet - The specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.
- Apache Parquet Documentation - The official documentation for Apache Parquet.
- ssphub - Un atelier de l'Insee illustrant l'utilisation des données du recensement 🇫🇷 diffusées au format Parquet.
Contributions welcome! Read the contribution guidelines first.