Skip to content

severo/awesome-parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Awesome Parquet Awesome

Parquet Logo

Useful resources for using the Parquet format

Contents

Libraries

C GLib

  • Arrow GLib - A wrapper library for Arrow C++.
  • DuckDB - An in-process database library that supports reading and writing Parquet files.

C++

  • Apache Arrow C++ - A library with support for reading and writing Parquet files.
  • DuckDB C++ API - Internal DuckDB C++ API.
  • libcudf - A GPU-accelerated DataFrame library for tabular data processing.

Dart

Go

  • duckdb-go - DuckDB Go client.
  • parquet - Official Go implementation of Apache Arrow.
  • parsyl/parquet - A Go library for reading and writing Parquet files.

Java

  • cudf - Java bindings for cudf, to be able to process large amounts of data on a GPU.
  • duckdb-java - DuckDB Java/JDBC API.
  • parquet-carpet - A Java library for serializing and deserializing Parquet files efficiently using Java records.
  • parquet-java - A Java implementation of the Parquet format, owned by the Apache Software Foundation.

JavaScript

  • duckdb-wasm - WebAssembly version of DuckDB.
  • duckdb-node-neo - DuckDB Node.js client.
  • hyparquet - A lightweight, dependency-free, pure JavaScript library for parsing Apache Parquet files.
  • parquet-wasm - WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet and arrow crates.

Julia

  • DuckDB - Official DuckDB Julia package.
  • Parquet.jl - Julia implementation of Parquet columnar file format reader.

.NET

PHP

Python

  • duckdb-python - DuckDB Python client.
  • pyarrow - A Python API for functionality provided by the Arrow C++ libraries, along with tools for Arrow integration and interoperability with Pandas, NumPy, and other software in the Python ecosystem.
  • pylibcudf - A lightweight Cython interface to libcudf that provides near-zero overhead for GPU-accelerated data processing in Python.
  • fastparquet - A Python implementation of the Parquet columnar file format.

R

  • arrow - The arrow package provides an Arrow C++ backend to dplyr, and access to the Arrow C++ library through familiar base R and tidyverse functions, or R6 classes.
  • duckdb-r - DuckDB R package.
  • nanoparquet - A reader and writer for a common subset of Parquet files.

Ruby

  • Red Parquet - The Ruby bindings of Apache Parquet, based on GObject Introspection.

Rust

  • datafusion - An extensible query engine written in Rust that can read/write Parquet files using SQL or a DataFrame API.
  • duckdb-rs - DuckDB Rust client.
  • parquet - The official Native Rust implementation of Apache Parquet, part of the Apache Arrow project.
  • Polars - A DataFrame interface on top of an OLAP Query Engine that supports reading and writing Parquet files, with bindings for Python.

Swift

Tools

Command-line

  • DataFusion CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
  • DuckDB CLI - A single, dependency-free executable that can read and write Parquet files, with a SQL interface.
  • parquet-tools - Python-based CLI tool for exploring parquet files (part of Apache Arrow).
  • parquet-cli - Java-based CLI tool for exploring parquet files.
  • parquet-cli-standalone - A JAR file for the parquet-cli tool which can be run without any dependencies.
  • Spark - A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
  • Tabiew - A lightweight TUI application to view and query tabular data files, such as CSV, TSV, and parquet.

Desktop applications

  • Pink Parquet - A free and open-source, user-friendly viewer for Parquet files for Windows.
  • Tad - An application for viewing and analyzing tabular data sets.

Plugins

  • nf-parquet - A Nextflow plugin able to read and write parquet files.

Web

  • ChatDB - Online tools for viewing and converting from and to Parquet files.
  • DataConverter.io - Online tools for viewing, converting, and transforming Parquet files.
  • Datasette - A tool to explore datasets, with support for reading Parquet files.
  • Onyxia Data Explorer - A web-based tool to explore Parquet files in the browser.
  • Parquet File Visualizer - Claude-code generated parquet metadata vizualizer that runs in your browser.
  • Parquet Viewer - View parquet files online.
  • Quak - A scalable data profiler for quickly scanning large tables.

Resources

Blogs

Documentation

  • Parquet - The specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.
  • Apache Parquet Documentation - The official documentation for Apache Parquet.

Educative resources

  • ssphub - Un atelier de l'Insee illustrant l'utilisation des donnĂ©es du recensement 🇫🇷 diffusĂ©es au format Parquet.

Tests

Related formats

  • F3 - A data file format that is designed with efficiency, interoperability, and extensibility in mind.
  • GeoParquet - Specification for storing geospatial vector data (point, line, polygon) in Parquet.
  • Iceberg - A high-performance format for huge analytic tables, that supports Parquet as one of its storage formats.
  • Lance - Modern columnar data format for ML and LLMs.
  • Nimble - File format for storage of large columnar datasets.
  • ORC - Self-describing type-aware columnar file format designed for Hadoop workloads.
  • Vortex - A columnar file format designed for high-performance data processing.

Contributing

Contributions welcome! Read the contribution guidelines first.

About

Useful resources for using the Parquet format

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks