|
4 | 4 |
|
5 | 5 | Spark data sources are tables or files that can be loaded from some Spark store (e.g. Hive or in-memory). They can also be specified by a SQL query. |
6 | 6 |
|
| 7 | +**New in Feast:** SparkSource now supports advanced table formats including **Apache Iceberg**, **Delta Lake**, and **Apache Hudi**, enabling ACID transactions, time travel, and schema evolution capabilities. See the [Table Formats guide](table-formats.md) for detailed documentation. |
| 8 | + |
7 | 9 | ## Disclaimer |
8 | 10 |
|
9 | 11 | The Spark data source does not achieve full test coverage. |
10 | 12 | Please do not assume complete stability. |
11 | 13 |
|
12 | 14 | ## Examples |
13 | 15 |
|
| 16 | +### Basic Examples |
| 17 | + |
14 | 18 | Using a table reference from SparkSession (for example, either in-memory or a Hive Metastore): |
15 | 19 |
|
16 | 20 | ```python |
@@ -51,8 +55,77 @@ my_spark_source = SparkSource( |
51 | 55 | ) |
52 | 56 | ``` |
53 | 57 |
|
| 58 | +### Table Format Examples |
| 59 | + |
| 60 | +SparkSource supports advanced table formats for modern data lakehouse architectures. For detailed documentation, configuration options, and best practices, see the **[Table Formats guide](table-formats.md)**. |
| 61 | + |
| 62 | +#### Apache Iceberg |
| 63 | + |
| 64 | +```python |
| 65 | +from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource |
| 66 | +from feast.table_format import IcebergFormat |
| 67 | + |
| 68 | +iceberg_format = IcebergFormat( |
| 69 | + catalog="my_catalog", |
| 70 | + namespace="my_database" |
| 71 | +) |
| 72 | + |
| 73 | +my_spark_source = SparkSource( |
| 74 | + name="user_features", |
| 75 | + path="my_catalog.my_database.user_table", |
| 76 | + table_format=iceberg_format, |
| 77 | + timestamp_field="event_timestamp" |
| 78 | +) |
| 79 | +``` |
| 80 | + |
| 81 | +#### Delta Lake |
| 82 | + |
| 83 | +```python |
| 84 | +from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource |
| 85 | +from feast.table_format import DeltaFormat |
| 86 | + |
| 87 | +delta_format = DeltaFormat() |
| 88 | + |
| 89 | +my_spark_source = SparkSource( |
| 90 | + name="transaction_features", |
| 91 | + path="s3://my-bucket/delta-tables/transactions", |
| 92 | + table_format=delta_format, |
| 93 | + timestamp_field="transaction_timestamp" |
| 94 | +) |
| 95 | +``` |
| 96 | + |
| 97 | +#### Apache Hudi |
| 98 | + |
| 99 | +```python |
| 100 | +from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource |
| 101 | +from feast.table_format import HudiFormat |
| 102 | + |
| 103 | +hudi_format = HudiFormat( |
| 104 | + table_type="COPY_ON_WRITE", |
| 105 | + record_key="user_id", |
| 106 | + precombine_field="updated_at" |
| 107 | +) |
| 108 | + |
| 109 | +my_spark_source = SparkSource( |
| 110 | + name="user_profiles", |
| 111 | + path="s3://my-bucket/hudi-tables/user_profiles", |
| 112 | + table_format=hudi_format, |
| 113 | + timestamp_field="event_timestamp" |
| 114 | +) |
| 115 | +``` |
| 116 | + |
| 117 | +For advanced configuration including time travel, incremental queries, and performance tuning, see the **[Table Formats guide](table-formats.md)**. |
| 118 | + |
| 119 | +## Configuration Options |
| 120 | + |
54 | 121 | The full set of configuration options is available [here](https://rtd.feast.dev/en/master/#feast.infra.offline_stores.contrib.spark_offline_store.spark_source.SparkSource). |
55 | 122 |
|
| 123 | +### Table Format Options |
| 124 | + |
| 125 | +- **IcebergFormat**: See [Table Formats - Iceberg](table-formats.md#apache-iceberg) |
| 126 | +- **DeltaFormat**: See [Table Formats - Delta Lake](table-formats.md#delta-lake) |
| 127 | +- **HudiFormat**: See [Table Formats - Hudi](table-formats.md#apache-hudi) |
| 128 | + |
56 | 129 | ## Supported Types |
57 | 130 |
|
58 | 131 | Spark data sources support all eight primitive types and their corresponding array types. |
|
0 commit comments