diff --git a/docs/flatgeobuf.ipynb b/docs/flatgeobuf.ipynb new file mode 100644 index 00000000..3be1e612 --- /dev/null +++ b/docs/flatgeobuf.ipynb @@ -0,0 +1,248 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "64d209be", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "source": [ + "# SedonaDB + FlatGeobuf\n", + "\n", + "This page explains how to read FlatGeobuf files with SedonaDB.\n", + "\n", + "FlatGeobuf is a cloud-optimized binary format for geographic vector data designed for fast streaming and spatial filtering over HTTP. It has a built-in spatial index, is easily compactible, contains CRS information, and is supported by many engines.\n", + "\n", + "SedonaDB is well-suited for reading FlatGeobuf files because it can leverage the FlatGeobuf index to read only a portion of the file.\n", + "\n", + "The examples on this page show you how to query FlatGeobuf files with SedonaDB over HTTP." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a746c47d", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "import sedona.db\n", + "\n", + "sd = sedona.db.connect()" + ] + }, + { + "cell_type": "markdown", + "id": "87c9bf67-cb6c-445c-8199-727bacbb412e", + "metadata": {}, + "source": [ + "## Read Microsoft Buildings FlatGeobuf data with SedonaDB\n", + "\n", + "The Microsoft buildings dataset is a comprehensive open dataset of building footprints extracted from satellite imagery using computer vision and deep learning.\n", + "\n", + "Here's how to read the Microsoft buildings dataset into a SedonaDB DataFrame and print a few rows." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "397ef4cf", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "┌─────────────────────────────────┐\n", + "│ wkb_geometry │\n", + "│ geometry │\n", + "╞═════════════════════════════════╡\n", + "│ POINT(-97.16154292 26.08759861) │\n", + "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", + "│ POINT(-97.1606625 26.08481) │\n", + "├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤\n", + "│ POINT(-97.16133375 26.08519809) │\n", + "└─────────────────────────────────┘\n" + ] + } + ], + "source": [ + "url = \"https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.fgb.zip\"\n", + "df = sd.read_pyogrio(url)\n", + "df.show(3)" + ] + }, + { + "cell_type": "markdown", + "id": "120e8f67-8914-4545-8f31-d38d5b6d6e7e", + "metadata": {}, + "source": [ + "You can see that the Microsoft Buildings dataset contains the building centroids.\n", + "\n", + "Take a look at the schema and see how it contains the `wkb_geometry` column and the CRS." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "5f4256d2-3ecb-41d1-839b-1deeb22a3600", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SedonaSchema with 1 field:\n", + " wkb_geometry: geometry" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.schema" + ] + }, + { + "cell_type": "markdown", + "id": "f24f6b0d-44af-403d-b64e-6c745612f8b8", + "metadata": {}, + "source": [ + "Now lets see how to read another FlatGeobuf dataset." + ] + }, + { + "cell_type": "markdown", + "id": "d30ab78a-3692-48ea-836c-ed31d497a5fd", + "metadata": {}, + "source": [ + "## Read Vermont boundary FlatGeobuf data with SedonaDB\n", + "\n", + "The Vermont boundary dataset contains the polygon for the state of Vermont.\n", + "\n", + "The following example shows how to read the Vermont FlatGeobuf dataset and plot it." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "81b0558f", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "url = \"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/example-crs/files/example-crs_vermont-utm.fgb\"\n", + "sd.read_pyogrio(url).to_pandas().plot()" + ] + }, + { + "cell_type": "markdown", + "id": "23ec9af7-ec3b-45c8-a589-4d92d0cb9c02", + "metadata": {}, + "source": [ + "## Read a portion of a large remote FlatGeobuf file\n", + "\n", + "Now let's look at how to read a portion of a 12GB FlatGeobuf file." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "d887c499-a5d9-4f25-9875-851525b5c88d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "┌──────────────────────────────────┐\n", + "│ sum(population_areas.population) │\n", + "│ int64 │\n", + "╞══════════════════════════════════╡\n", + "│ 256251 │\n", + "└──────────────────────────────────┘\n", + "CPU times: user 16 ms, sys: 15.3 ms, total: 31.4 ms\n", + "Wall time: 493 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "url = \"https://flatgeobuf.septima.dk/population_areas.fgb\"\n", + "sd.read_pyogrio(url).to_view(\"population_areas\", True)\n", + "\n", + "wkt = \"POLYGON ((-73.978329 40.767412, -73.950005 40.767412, -73.950005 40.795098, -73.978329 40.795098, -73.978329 40.767412))\"\n", + "sd.sql(\n", + " f\"\"\"\n", + "SELECT sum(population::INTEGER) FROM population_areas\n", + "WHERE ST_Intersects(wkb_geometry, ST_SetSRID(ST_GeomFromWKT('{wkt}'), 4326))\n", + "\"\"\"\n", + ").show()" + ] + }, + { + "cell_type": "markdown", + "id": "ef6cf480-f4f5-4a9f-9f52-6370fc41af29", + "metadata": {}, + "source": [ + "SedonaDB can query the 12GB FlatGeobuf file in about half of a second on a laptop for this area of interest." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/flatgeobuf.md b/docs/flatgeobuf.md new file mode 100644 index 00000000..1943baa1 --- /dev/null +++ b/docs/flatgeobuf.md @@ -0,0 +1,117 @@ +# SedonaDB + FlatGeobuf + +This page explains how to read FlatGeobuf files with SedonaDB. + +FlatGeobuf is a cloud-optimized binary format for geographic vector data designed for fast streaming and spatial filtering over HTTP. It has a built-in spatial index, is easily compactible, contains CRS information, and is supported by many engines. + +SedonaDB is well-suited for reading FlatGeobuf files because it can leverage the FlatGeobuf index to read only a portion of the file. + +The examples on this page show you how to query FlatGeobuf files with SedonaDB over HTTP. + + +```python +import sedona.db + +sd = sedona.db.connect() +``` + +## Read Microsoft Buildings FlatGeobuf data with SedonaDB + +The Microsoft buildings dataset is a comprehensive open dataset of building footprints extracted from satellite imagery using computer vision and deep learning. + +Here's how to read the Microsoft buildings dataset into a SedonaDB DataFrame and print a few rows. + + +```python +url = "https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.fgb.zip" +df = sd.read_pyogrio(url) +df.show(3) +``` + + ┌─────────────────────────────────┐ + │ wkb_geometry │ + │ geometry │ + ╞═════════════════════════════════╡ + │ POINT(-97.16154292 26.08759861) │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ POINT(-97.1606625 26.08481) │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ POINT(-97.16133375 26.08519809) │ + └─────────────────────────────────┘ + + +You can see that the Microsoft Buildings dataset contains the building centroids. + +Take a look at the schema and see how it contains the `wkb_geometry` column and the CRS. + + +```python +df.schema +``` + + + + + SedonaSchema with 1 field: + wkb_geometry: geometry + + + +Now lets see how to read another FlatGeobuf dataset. + +## Read Vermont boundary FlatGeobuf data with SedonaDB + +The Vermont boundary dataset contains the polygon for the state of Vermont. + +The following example shows how to read the Vermont FlatGeobuf dataset and plot it. + + +```python +url = "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/example-crs/files/example-crs_vermont-utm.fgb" +sd.read_pyogrio(url).to_pandas().plot() +``` + + + + + + + + + + +![png](flatgeobuf_files/flatgeobuf_8_1.png) + + + +## Read a portion of a large remote FlatGeobuf file + +Now let's look at how to read a portion of a 12GB FlatGeobuf file. + + +```python +%%time + +url = "https://flatgeobuf.septima.dk/population_areas.fgb" +sd.read_pyogrio(url).to_view("population_areas", True) + +wkt = "POLYGON ((-73.978329 40.767412, -73.950005 40.767412, -73.950005 40.795098, -73.978329 40.795098, -73.978329 40.767412))" +sd.sql( + f""" +SELECT sum(population::INTEGER) FROM population_areas +WHERE ST_Intersects(wkb_geometry, ST_SetSRID(ST_GeomFromWKT('{wkt}'), 4326)) +""" +).show() +``` + + ┌──────────────────────────────────┐ + │ sum(population_areas.population) │ + │ int64 │ + ╞══════════════════════════════════╡ + │ 256251 │ + └──────────────────────────────────┘ + CPU times: user 16 ms, sys: 15.3 ms, total: 31.4 ms + Wall time: 493 ms + + +SedonaDB can query the 12GB FlatGeobuf file in about half of a second on a laptop for this area of interest. diff --git a/mkdocs.yml b/mkdocs.yml index f169c866..0cf91486 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -51,6 +51,7 @@ nav: - CRS Examples: crs-examples.md - Delta Lake: delta-lake.md - Iceberg: iceberg.md + - FlatGeobuf: flatgeobuf.md - Working with Parquet Files: working-with-parquet-files.md - Working with SQL in SedonaDB: working-with-sql-sedonadb.md - Contributors Guide: contributors-guide.md