Skip to content

Commit d0f01d4

Browse files
bountxAndrzej Pijanowskijonhealy1
authored
Implement Validation for Non-Queryable Attributes in Filtering (#532)
**Related Issue(s):** - #530 **Description:** Created query validation to return bad request 400 on requests with parameters not existent in any queryables without unnecessary searching (optional: turned off by default - turned on by environment variable `VALIDATE_QUERYABLES`). To do that PR also introduces local caching system (global variable `_queryables_cache_instance: QueryablesCache`) that collects a set of queryables parameters from all the collections and updates it hourly by default (set with environment variable `QUERYABLES_CACHE_TTL` in seconds - default to `3600`, 1 hour). **PR Checklist:** - [x] Code is formatted and linted (run `pre-commit run --all-files`) - [x] Tests pass (run `make test`) - [x] Documentation has been updated to reflect changes, if applicable - [x] Changes are added to the changelog --------- Co-authored-by: Andrzej Pijanowski <[email protected]> Co-authored-by: Jonathan Healy <[email protected]>
1 parent a7cdd15 commit d0f01d4

File tree

13 files changed

+992
-22
lines changed

13 files changed

+992
-22
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
99

1010
### Added
1111

12+
- Environment variable `VALIDATE_QUERYABLES` to enable/disable validation of queryables in search/filter requests. When set to `true`, search requests will be validated against the defined queryables, returning an error for any unsupported fields. Defaults to `false` for backward compatibility.[#532](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/532)
13+
14+
- Environment variable `QUERYABLES_CACHE_TTL` to configure the TTL (in seconds) for caching queryables. Default is `1800` seconds (30 minutes) to balance performance and freshness of queryables data. [#532](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/532)
15+
1216
- Added optional `/catalogs` route support to enable federated hierarchical catalog browsing and navigation. [#547](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/547)
17+
1318
- Added DELETE `/catalogs/{catalog_id}/collections/{collection_id}` endpoint to support removing collections from catalogs. When a collection belongs to multiple catalogs, it removes only the specified catalog from the collection's parent_ids. When a collection belongs to only one catalog, the collection is deleted entirely. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554)
19+
1420
- Added `parent_ids` internal field to collections to support multi-catalog hierarchies. Collections can now belong to multiple catalogs, with parent catalog IDs stored in this field for efficient querying and management. [#554](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/554)
1521

1622
### Changed

README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -469,8 +469,10 @@ You can customize additional settings in your `.env` file:
469469
| `STAC_INDEX_ASSETS` | Controls if Assets are indexed when added to Elasticsearch/Opensearch. This allows asset fields to be included in search queries. | `false` | Optional |
470470
| `USE_DATETIME` | Configures the datetime search behavior in SFEOS. When enabled, searches both datetime field and falls back to start_datetime/end_datetime range for items with null datetime. When disabled, searches only by start_datetime/end_datetime range. | `true` | Optional |
471471
| `USE_DATETIME_NANOS` | Enables nanosecond precision handling for `datetime` field searches as per the `date_nanos` type. When `False`, it uses 3 millisecond precision as per the type `date`. | `true` | Optional |
472-
| `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. | None | Optional |
472+
| `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. If `VALIDATE_QUERYABLES` is enabled, these fields will also be considered invalid for filtering. | None | Optional |
473473
| `EXCLUDED_FROM_ITEMS` | Specifies fields to exclude from STAC item responses. Supports comma-separated field names and dot notation for nested fields (e.g., `private_data,properties.confidential,assets.internal`). | `None` | Optional |
474+
| `VALIDATE_QUERYABLES` | Enable validation of query parameters against the collection's queryables. If set to `true`, the API will reject queries containing fields that are not defined in the collection's queryables. | `false` | Optional |
475+
| `QUERYABLES_CACHE_TTL` | Time-to-live (in seconds) for the queryables cache. Used when `VALIDATE_QUERYABLES` is enabled. | `1800` | Optional |
474476

475477

476478
> [!NOTE]
@@ -526,6 +528,29 @@ EXCLUDED_FROM_QUERYABLES="properties.auth:schemes,properties.storage:schemes,pro
526528
- Excluded fields and their nested children will be skipped during field traversal
527529
- Both the field itself and any nested properties will be excluded
528530

531+
## Queryables Validation
532+
533+
SFEOS supports validating query parameters against the collection's defined queryables. This ensures that users only query fields that are explicitly exposed and indexed.
534+
535+
**Configuration:**
536+
537+
To enable queryables validation, set the following environment variables:
538+
539+
```bash
540+
VALIDATE_QUERYABLES=true
541+
QUERYABLES_CACHE_TTL=1800 # Optional, defaults to 1800 seconds (30 minutes)
542+
```
543+
544+
**Behavior:**
545+
546+
- When enabled, the API maintains a cache of all queryable fields across all collections.
547+
- Search requests (both GET and POST) are checked against this cache.
548+
- If a request contains a query parameter or filter field that is not in the list of allowed queryables, the API returns a `400 Bad Request` error with a message indicating the invalid field(s).
549+
- The cache is automatically refreshed based on the `QUERYABLES_CACHE_TTL` setting.
550+
- **Interaction with `EXCLUDED_FROM_QUERYABLES`**: If `VALIDATE_QUERYABLES` is enabled, fields listed in `EXCLUDED_FROM_QUERYABLES` will also be considered invalid for filtering. This effectively enforces the exclusion of these fields from search queries.
551+
552+
This feature helps prevent queries on non-queryable fields which could lead to unnecessary load on the database.
553+
529554
## Datetime-Based Index Management
530555

531556
### Overview

stac_fastapi/core/stac_fastapi/core/base_database_logic.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,10 @@ async def delete_collection(
140140
pass
141141

142142
@abc.abstractmethod
143+
async def get_queryables_mapping(self, collection_id: str = "*") -> Dict[str, Any]:
144+
"""Retrieve mapping of Queryables for search."""
145+
pass
146+
143147
async def get_all_catalogs(
144148
self,
145149
token: Optional[str],

stac_fastapi/core/stac_fastapi/core/core.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@
2424
from stac_fastapi.core.base_settings import ApiBaseSettings
2525
from stac_fastapi.core.datetime_utils import format_datetime_range
2626
from stac_fastapi.core.models.links import PagingLinks
27+
from stac_fastapi.core.queryables import (
28+
QueryablesCache,
29+
get_properties_from_cql2_filter,
30+
)
2731
from stac_fastapi.core.serializers import (
2832
CatalogSerializer,
2933
CollectionSerializer,
@@ -92,6 +96,10 @@ class CoreClient(AsyncBaseCoreClient):
9296
title: str = attr.ib(default="stac-fastapi")
9397
description: str = attr.ib(default="stac-fastapi")
9498

99+
def __attrs_post_init__(self):
100+
"""Initialize the queryables cache."""
101+
self.queryables_cache = QueryablesCache(self.database)
102+
95103
def extension_is_enabled(self, extension_name: str) -> bool:
96104
"""Check if an extension is enabled by checking self.extensions.
97105
@@ -844,6 +852,8 @@ async def post_search(
844852
)
845853

846854
if hasattr(search_request, "query") and getattr(search_request, "query"):
855+
query_fields = set(getattr(search_request, "query").keys())
856+
await self.queryables_cache.validate(query_fields)
847857
for field_name, expr in getattr(search_request, "query").items():
848858
field = "properties__" + field_name
849859
for op, value in expr.items():
@@ -862,7 +872,11 @@ async def post_search(
862872

863873
if cql2_filter is not None:
864874
try:
875+
query_fields = get_properties_from_cql2_filter(cql2_filter)
876+
await self.queryables_cache.validate(query_fields)
865877
search = await self.database.apply_cql2_filter(search, cql2_filter)
878+
except HTTPException:
879+
raise
866880
except Exception as e:
867881
raise HTTPException(
868882
status_code=400, detail=f"Error with cql2 filter: {e}"
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
"""A module for managing queryable attributes."""
2+
3+
import asyncio
4+
import os
5+
import time
6+
from typing import Any, Dict, List, Set
7+
8+
from fastapi import HTTPException
9+
10+
11+
class QueryablesCache:
12+
"""A thread-safe, time-based cache for queryable properties."""
13+
14+
def __init__(self, database_logic: Any):
15+
"""
16+
Initialize the QueryablesCache.
17+
18+
Args:
19+
database_logic: An instance of a class with a `get_queryables_mapping` method.
20+
"""
21+
self._db_logic = database_logic
22+
self._cache: Dict[str, List[str]] = {}
23+
self._all_queryables: Set[str] = set()
24+
self._last_updated: float = 0
25+
self._lock = asyncio.Lock()
26+
self.validation_enabled: bool = False
27+
self.cache_ttl: int = 1800 # How often to refresh cache (in seconds)
28+
self.reload_settings()
29+
30+
def reload_settings(self):
31+
"""Reload settings from environment variables."""
32+
self.validation_enabled = (
33+
os.getenv("VALIDATE_QUERYABLES", "false").lower() == "true"
34+
)
35+
self.cache_ttl = int(os.getenv("QUERYABLES_CACHE_TTL", "1800"))
36+
37+
async def _update_cache(self):
38+
"""Update the cache with the latest queryables from the database."""
39+
if not self.validation_enabled:
40+
return
41+
42+
async with self._lock:
43+
if (time.time() - self._last_updated < self.cache_ttl) and self._cache:
44+
return
45+
46+
queryables_mapping = await self._db_logic.get_queryables_mapping()
47+
all_queryables_set = set(queryables_mapping.keys())
48+
49+
self._all_queryables = all_queryables_set
50+
51+
self._cache = {"*": list(all_queryables_set)}
52+
self._last_updated = time.time()
53+
54+
async def get_all_queryables(self) -> Set[str]:
55+
"""
56+
Return a set of all queryable attributes across all collections.
57+
58+
This method will update the cache if it's stale or has been cleared.
59+
"""
60+
if not self.validation_enabled:
61+
return set()
62+
63+
if (time.time() - self._last_updated >= self.cache_ttl) or not self._cache:
64+
await self._update_cache()
65+
return self._all_queryables
66+
67+
async def validate(self, fields: Set[str]) -> None:
68+
"""
69+
Validate if the provided fields are queryable.
70+
71+
Raises HTTPException if invalid fields are found.
72+
"""
73+
if not self.validation_enabled:
74+
return
75+
76+
allowed_fields = await self.get_all_queryables()
77+
invalid_fields = fields - allowed_fields
78+
if invalid_fields:
79+
raise HTTPException(
80+
status_code=400,
81+
detail=f"Invalid query fields: {', '.join(sorted(invalid_fields))}. "
82+
"These fields are not defined in the collection's queryables. "
83+
"Use the /queryables endpoint to see available fields.",
84+
)
85+
86+
87+
def get_properties_from_cql2_filter(cql2_filter: Dict[str, Any]) -> Set[str]:
88+
"""Recursively extract property names from a CQL2 filter.
89+
90+
Property names are normalized by stripping the 'properties.' prefix
91+
if present, to match queryables stored without the prefix.
92+
"""
93+
props: Set[str] = set()
94+
if "op" in cql2_filter and "args" in cql2_filter:
95+
for arg in cql2_filter["args"]:
96+
if isinstance(arg, dict):
97+
if "op" in arg:
98+
props.update(get_properties_from_cql2_filter(arg))
99+
elif "property" in arg:
100+
prop_name = arg["property"]
101+
# Strip 'properties.' prefix if present
102+
if prop_name.startswith("properties."):
103+
prop_name = prop_name[11:]
104+
props.add(prop_name)
105+
return props

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/mapping.py

Lines changed: 83 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,62 @@
33
This module provides functions for working with Elasticsearch/OpenSearch mappings.
44
"""
55

6-
from typing import Any, Dict
6+
import os
7+
from collections import deque
8+
from typing import Any, Dict, Set
9+
10+
11+
def _get_excluded_from_queryables() -> Set[str]:
12+
"""Get fields to exclude from queryables endpoint and filtering.
13+
14+
Reads from EXCLUDED_FROM_QUERYABLES environment variable.
15+
Supports comma-separated list of field names.
16+
17+
For each exclusion pattern, both the original and the version with/without
18+
'properties.' prefix are included. This ensures fields are excluded regardless
19+
of whether they appear at the top level or under 'properties' in the mapping.
20+
21+
Example:
22+
EXCLUDED_FROM_QUERYABLES="properties.auth:schemes,storage:schemes"
23+
24+
This will exclude:
25+
- properties.auth:schemes (and children like properties.auth:schemes.s3.type)
26+
- auth:schemes (and children like auth:schemes.s3.type)
27+
- storage:schemes (and children)
28+
- properties.storage:schemes (and children)
29+
30+
Returns:
31+
Set[str]: Set of field names to exclude from queryables
32+
"""
33+
excluded = os.getenv("EXCLUDED_FROM_QUERYABLES", "")
34+
if not excluded:
35+
return set()
36+
37+
result = set()
38+
for field in excluded.split(","):
39+
field = field.strip()
40+
if not field:
41+
continue
42+
43+
result.add(field)
44+
45+
if field.startswith("properties."):
46+
result.add(field.removeprefix("properties."))
47+
else:
48+
result.add(f"properties.{field}")
49+
50+
return result
751

852

953
async def get_queryables_mapping_shared(
10-
mappings: Dict[str, Dict[str, Any]], collection_id: str = "*"
54+
mappings: Dict[str, Dict[str, Any]],
55+
collection_id: str = "*",
1156
) -> Dict[str, str]:
1257
"""Retrieve mapping of Queryables for search.
1358
59+
Fields listed in the EXCLUDED_FROM_QUERYABLES environment variable will be
60+
excluded from the result, along with their children.
61+
1462
Args:
1563
mappings (Dict[str, Dict[str, Any]]): The mapping information returned from
1664
Elasticsearch/OpenSearch client's indices.get_mapping() method.
@@ -20,19 +68,44 @@ async def get_queryables_mapping_shared(
2068
2169
Returns:
2270
Dict[str, str]: A dictionary containing the Queryables mappings, where keys are
23-
field names and values are the corresponding paths in the Elasticsearch/OpenSearch
24-
document structure.
71+
field names (with 'properties.' prefix removed) and values are the
72+
corresponding paths in the Elasticsearch/OpenSearch document structure.
2573
"""
2674
queryables_mapping = {}
75+
excluded = _get_excluded_from_queryables()
76+
77+
def is_excluded(path: str) -> bool:
78+
"""Check if the path starts with any excluded prefix."""
79+
return any(
80+
path == prefix or path.startswith(prefix + ".") for prefix in excluded
81+
)
2782

2883
for mapping in mappings.values():
29-
fields = mapping["mappings"].get("properties", {})
30-
properties = fields.pop("properties", {}).get("properties", {}).keys()
84+
mapping_properties = mapping["mappings"].get("properties", {})
85+
86+
stack: deque[tuple[str, Dict[str, Any]]] = deque(mapping_properties.items())
87+
88+
while stack:
89+
field_fqn, field_def = stack.popleft()
90+
91+
nested_properties = field_def.get("properties")
92+
if nested_properties:
93+
stack.extend(
94+
(f"{field_fqn}.{k}", v)
95+
for k, v in nested_properties.items()
96+
if v.get("enabled", True) and not is_excluded(f"{field_fqn}.{k}")
97+
)
98+
99+
field_type = field_def.get("type")
100+
if (
101+
not field_type
102+
or not field_def.get("enabled", True)
103+
or is_excluded(field_fqn)
104+
):
105+
continue
31106

32-
for field_key in fields:
33-
queryables_mapping[field_key] = field_key
107+
field_name = field_fqn.removeprefix("properties.")
34108

35-
for property_key in properties:
36-
queryables_mapping[property_key] = f"properties.{property_key}"
109+
queryables_mapping[field_name] = field_fqn
37110

38111
return queryables_mapping

0 commit comments

Comments
 (0)