-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Add security threat model (THREAT_MODEL.md) + SECURITY.md/AGENTS.md discoverability #6535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to you under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| # AGENTS.md | ||
|
|
||
| ## Security | ||
|
|
||
| You are helping a security researcher find and report vulnerabilities in | ||
| Apache Hive. Before drafting any report or reaching any conclusion, complete | ||
| these steps. | ||
|
|
||
| ### Step 1 — Read the threat model | ||
| Read **[THREAT_MODEL.md](THREAT_MODEL.md)**: the trust boundaries (the | ||
| HiveServer2 SQL front door, the Metastore, the query/UDF execution layer), the | ||
| adversaries in and out of scope, and what Hive upholds versus what it leaves to | ||
| the operator. | ||
|
|
||
| ### Step 2 — Read the security policy | ||
| Read **[SECURITY.md](SECURITY.md)** for how to report. | ||
|
|
||
| ### Key scoping facts (see THREAT_MODEL.md) | ||
| - The **HiveServer2** SQL front door is the primary untrusted boundary; the | ||
| Metastore and execution cluster are assumed to run inside an | ||
| operator-controlled perimeter. | ||
| - **UDFs, SerDes, custom InputFormats, and `TRANSFORM` scripts are | ||
| code-execution by design**, not a sandbox — running authorized code is a | ||
| feature, not a vulnerability. | ||
| - Transport security (TLS), the choice of authorization model (Ranger / | ||
| SQL-standard / storage-based), and network isolation are **operator** | ||
| responsibilities, not engine invariants. | ||
| - Hive does **not** defend against an operator with `root`, the Hadoop | ||
| superuser, or direct HDFS / metastore-DB access. | ||
|
|
||
| ### Step 3 — Route the finding | ||
| Route the finding to exactly one disposition in **THREAT_MODEL.md §13** | ||
| (VALID, or one of the `OUT-OF-MODEL` / `BY-DESIGN` dispositions) and cite the | ||
| section that justifies the call. This model is **v0** — open questions for the | ||
| PMC are in §14. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to you under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| # Security Policy | ||
|
|
||
| ## Reporting a Vulnerability | ||
|
|
||
| Please report suspected security vulnerabilities in Apache Hive **privately** | ||
| to the Hive security list at `security@hive.apache.org`, following the | ||
| [Apache Software Foundation security process](https://www.apache.org/security/). | ||
| Do **not** open public GitHub issues or pull requests for security reports — a | ||
| private report lets the issue be investigated and fixed before disclosure. | ||
|
|
||
| ## Threat Model | ||
|
|
||
| A threat model for Apache Hive is maintained in | ||
| [THREAT_MODEL.md](THREAT_MODEL.md). It describes the trust boundaries (the | ||
| HiveServer2 SQL front door, the Metastore, the query/UDF execution layer), the | ||
| adversaries in and out of scope, the security properties Hive upholds given its | ||
| deployment assumptions versus those left to the operator (transport security, | ||
| authorization-model choice, network isolation, UDF vetting), and the recurring | ||
| non-findings. Triagers of scanner, fuzzer, or AI-generated findings should | ||
| route each through `THREAT_MODEL.md` §13. | ||
|
|
||
| This file is **v0** and carries open questions for the Hive PMC in | ||
| `THREAT_MODEL.md` §14. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,223 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to you under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| # Apache Hive — Threat Model (v0 draft) | ||
|
|
||
| > **Status:** v0 draft produced by the ASF Security team (Michael Scovetta | ||
| > rubric, run with Claude Opus) for the Apache Hive PMC to review, correct, | ||
| > and own. Every non-trivial claim is provenance-tagged | ||
| > *(documented)* / *(maintainer)* / *(inferred)*; the *(inferred)* tags are | ||
| > the producer's hypotheses and each has a matching question in §14. Written | ||
| > against `master`; revise on a new public-facing surface (a new server | ||
| > endpoint, auth mechanism, file format, or execution engine), not on | ||
| > internal refactors. | ||
|
|
||
| ## §1 — Purpose and consumers | ||
|
|
||
| This document describes the **implicit security contract** between Apache | ||
| Hive and its downstream operators: what Hive assumes about its environment, | ||
| what it upholds given those assumptions, what it leaves to the operator, and | ||
| which "syntactically possible" misuses fall outside the intended design. It | ||
| serves the **integrator/operator** (which threats they own) and the | ||
| **triager** (classifying a scanner/AI/CVE-style finding as valid, out of | ||
| model, or disclaimed by design — cite the section). | ||
|
|
||
| ## §2 — What Hive is | ||
|
|
||
| Apache Hive is a **SQL data-warehouse layer over Apache Hadoop** *(documented: | ||
| README)*. The in-scope component families: | ||
|
|
||
| | Component | Role | Primary surface | | ||
| | --- | --- | --- | | ||
| | **HiveServer2 (HS2)** | the SQL front door — accepts queries over a Thrift/binary or HTTP transport, authenticates the session, compiles + authorizes + executes | network (the highest-value untrusted boundary) *(inferred — §14 Q1)* | | ||
| | **Hive Metastore (HMS)** | a Thrift service holding table/partition/schema metadata + storage locations | network (intra-cluster) *(inferred — §14 Q1)* | | ||
| | **Query compiler + execution** | parse → plan → run on Tez / MapReduce / Spark; reads/writes HDFS, HBase, object stores | depends on the configured engine *(documented: README)* | | ||
| | **UDF / SerDe / file-format layer** | user-supplied or built-in functions and (de)serializers invoked during execution | in-JVM code execution *(inferred — §14 Q2)* | | ||
| | **JDBC/ODBC drivers + Beeline** | client-side connectors | client trust domain | | ||
|
|
||
| Hive is **not** a standalone secured appliance: it is a clustered service | ||
| that depends on Hadoop (HDFS, YARN), a metastore RDBMS, and typically an | ||
| external authorization service (Apache Ranger or SQL-standard authorization) | ||
| and a KDC for Kerberos *(inferred — §14 Q3)*. | ||
|
|
||
| ## §3 — Adversaries in and out of scope | ||
|
|
||
| **In scope** *(inferred — §14 Q4)*: | ||
|
|
||
| 1. A **SQL client** connecting to HiveServer2 with valid or attempted-invalid | ||
| credentials, trying to read/modify data outside their authorization, or to | ||
| reach the host through query features. | ||
| 2. A **network adversary** between client and HS2 / between HS2 and HMS, where | ||
| transport security is not configured. | ||
|
|
||
| **Out of scope** *(inferred — §14 Q5)*: | ||
|
|
||
| 3. **An operator with `root` / the Hadoop superuser / direct HDFS or metastore-DB | ||
| access.** Anyone who already controls the storage layer or the cluster | ||
| processes is not an adversary Hive defends against → `OUT-OF-MODEL: | ||
| adversary-not-in-scope`. | ||
| 4. **A trusted authenticated admin** performing an authorized action (creating a | ||
| function, changing config, granting a role). A new path to a privilege the | ||
| principal already holds is `OUT-OF-MODEL: equivalent-harm`. | ||
| 5. **Bugs in the dependencies Hive orchestrates** — Hadoop/HDFS, YARN, Tez, | ||
| the metastore RDBMS, Ranger, the KDC, the JVM. Report upstream → | ||
| `OUT-OF-MODEL: unsupported-component`. | ||
|
|
||
| ## §4 — Trust boundaries | ||
|
|
||
| - **Client → HiveServer2** is the primary boundary. The session is | ||
| authenticated (Kerberos / LDAP / PAM / custom / none) and every statement is | ||
| authorization-checked before execution *(inferred — §14 Q6)*. Whether the SQL | ||
| text, JDBC connection properties, and session-configuration overrides are | ||
| treated as untrusted at this boundary is the load-bearing question for | ||
| triage. | ||
| - **HiveServer2 → Metastore** and **HS2 → execution engine / HDFS** are | ||
| intra-cluster boundaries assumed to run inside an operator-controlled, | ||
| network-isolated perimeter *(inferred — §14 Q3)*. | ||
| - **`doAs` impersonation:** when enabled, HS2 executes work as the connected | ||
| end user against HDFS rather than as the Hive service principal; when | ||
| disabled, all access runs as the Hive principal and authorization is fully | ||
| delegated to the SQL-layer authorizer *(inferred — §14 Q7)*. The two modes | ||
| have materially different blast radii and which is "the" supported posture | ||
| is a §14 question. | ||
|
|
||
| ## §5 — What Hive upholds (given §3/§4 assumptions) | ||
|
|
||
| *(all inferred — §14 Q8)* | ||
|
|
||
| - **Authentication** of the HS2 session via the configured mechanism before any | ||
| statement runs. | ||
| - **Authorization** of each statement against the configured model (Ranger / | ||
| SQL-standard / storage-based), scoped to the principal's granted privileges | ||
| on the named objects. | ||
| - **Memory safety on well-formed input** to the extent the JVM provides it; | ||
| Hive is Java, so classic memory-corruption is out of the language model. | ||
|
|
||
| ## §6 — What Hive leaves to the operator | ||
|
|
||
| *(inferred — §14 Q9)* | ||
|
|
||
| - **Transport security (TLS)** on the HS2 and Metastore endpoints, and the KDC | ||
| / LDAP server's own security. | ||
| - **Choosing and configuring an authorization model.** Storage-based and | ||
| SQL-standard and Ranger give materially different guarantees; the default | ||
| posture (and whether "no authorization configured" is a supported production | ||
| mode) is the operator's call. | ||
| - **Network isolation** of the Metastore, the metastore RDBMS, and the | ||
| execution cluster from untrusted networks. | ||
| - **Vetting UDFs / SerDes / aux JARs.** Adding a function or SerDe is adding | ||
| code to the execution JVM (see §9). | ||
|
|
||
| ## §7 — Properties Hive does *not* uphold (by design) | ||
|
|
||
| *(inferred — §14 Q10)* | ||
|
|
||
| - **A sandbox around UDFs, SerDes, custom InputFormats, or `TRANSFORM`/script | ||
| operators.** Code a principal is authorized to register or invoke runs with | ||
| the privileges of the execution process; this is a feature, not a | ||
| containment boundary. `BY-DESIGN: property-disclaimed`. | ||
| - **Protection against an operator who controls the underlying storage, | ||
| metastore DB, or cluster processes** (see §3 item 3). | ||
| - **Resource fairness / DoS protection as a hard guarantee.** A sufficiently | ||
| expensive query can exhaust cluster resources; per-pool/queue limits | ||
| (YARN, HS2 query limits) are the operator's lever, not an engine invariant | ||
| *(inferred — §14 Q11)*. | ||
|
|
||
| ## §8 — Key configuration levers (load-bearing) | ||
|
|
||
| *(all inferred — §14 Q12; confirm names + defaults)* | ||
|
|
||
| | Lever | Why it matters | | ||
| | --- | --- | | ||
| | Authentication mode (`hive.server2.authentication`) | `NONE` vs `KERBEROS`/`LDAP` decides whether the front door is open. | | ||
| | Authorization model (Ranger / SQL-std / storage-based / none) | decides whether statements are access-controlled at all. | | ||
| | `doAs` (`hive.server2.enable.doAs`) | decides whether HDFS access runs as the end user or the Hive principal. | | ||
| | TLS on HS2 / HMS transports | decides whether sessions + metadata are on the wire in clear. | | ||
| | UDF/SerDe allow-listing, `hive.security.authorization.sqlstd.confwhitelist` | decides which session config + functions an untrusted client may set/use. | | ||
|
|
||
| ## §11a — Known non-findings (seed for scanner/AI triage) | ||
|
|
||
| These are the dispositions the PMC most likely wants pre-recorded; confirm and | ||
| extend *(inferred — §14 Q13)*: | ||
|
|
||
| - **"A UDF / `TRANSFORM` script / custom SerDe can run arbitrary code."** | ||
| By design — registering or invoking code is an authorized operation, not a | ||
| sandbox escape. `BY-DESIGN`. | ||
| - **"HiveServer2 with `authentication=NONE` accepts anyone."** That is a | ||
| non-default / operator-chosen insecure configuration, not a Hive defect. | ||
| `OUT-OF-MODEL: non-default-build` (confirm the shipped default). | ||
| - **"The Metastore Thrift port has no authorization."** In-model only if the | ||
| PMC asserts HMS is meant to enforce caller authorization; if HMS is an | ||
| intra-cluster trusted service behind the perimeter, reports against direct | ||
| HMS access are `OUT-OF-MODEL: adversary-not-in-scope`. (§14 Q1/Q3.) | ||
| - **Dependency-tail CVEs** (Hadoop, Log4j, a transitive JAR) surfaced by an | ||
| SCA scanner against Hive's build — triage upstream unless Hive's own code | ||
| reaches the vulnerable path with untrusted input. | ||
|
|
||
| ## §13 — Triage dispositions | ||
|
|
||
| A finding is **VALID** only when all hold: the violated property is one Hive | ||
| claims (§5), the attacker is in scope (§3), and the affected code is on an | ||
| in-model surface (§2/§4) reached by untrusted input. Otherwise route to one | ||
| of: `OUT-OF-MODEL: adversary-not-in-scope` · `OUT-OF-MODEL: equivalent-harm` · | ||
| `OUT-OF-MODEL: unsupported-component` · `OUT-OF-MODEL: non-default-build` · | ||
| `BY-DESIGN: property-disclaimed`. | ||
|
|
||
| ## §14 — Open questions for the Hive PMC | ||
|
|
||
| Grouped in waves; answer inline (a few at a time is fine). Each promotes an | ||
| *(inferred)* tag to *(maintainer)* once confirmed. | ||
|
|
||
| **Wave 1 — scope & intended use** | ||
| 1. Is the in-scope surface "HiveServer2 (the SQL front door) + the artifacts it | ||
| compiles/executes", with the Metastore treated as an intra-cluster trusted | ||
| service? Or is direct Metastore access in scope? | ||
| 2. Are UDFs / SerDes / custom InputFormats / `TRANSFORM` scripts in scope as | ||
| *code-execution-by-design* (not a sandbox), per §7? | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, they are, but we may elaborate on details. PrerequisiteWe assume authentication and authorization are configured. Built-in UDFBasically, built-in UDFs should be safe. Apache Hive exceptionally includes some insecure UDFs (e.g., Custom UDFA Hive administrator must configure access policies (e.g., via Ranger) to allow only trusted users to add UDF jars or register UDFs. They are responsible for implementing safe UDFs, and Hive systems trust them. Hive can't guarantee safety when a trusted user adds compromised UDFs. SerDe/InputFormat/OutputFormatOnly Hive administrators can put jars. They are responsible for deploying secure SerDe/InputFormat/OutputFormat, and Hive systems trust them. Hive can't guarantee safety when a Hive administrator adds a compromised SerDe/InputFormat/OutputFormat. TRANSFORMIn a secure deployment, it must be prohibited. Major authorization plugins add |
||
| 3. Confirm the assumed deployment: clustered, behind an operator-controlled | ||
| perimeter, with Hadoop + a metastore RDBMS + (Ranger or SQL-std auth) + KDC | ||
| as trusted dependencies. | ||
| 4. Is the in-scope adversary "a SQL client at the HS2 boundary" (+ a network | ||
| MITM where TLS is off)? Anything to add? | ||
| 5. Confirm operators with storage/metastore-DB/cluster-process access, and | ||
| trusted admins doing authorized actions, are out of model. | ||
|
|
||
| **Wave 2 — trust boundaries & auth** | ||
| 6. At the client→HS2 boundary, are SQL text, JDBC connection properties, and | ||
| session-config overrides all treated as untrusted (subject to the conf | ||
| whitelist)? | ||
| 7. Which `doAs` posture is the supported/recommended one, and how does it | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When authentication and authorization, I expect |
||
| change the authorization story? | ||
| 8. What properties does Hive claim to uphold given valid input (auth, authz | ||
| scoping, others)? | ||
|
|
||
| **Wave 3 — disclaimed properties & defaults** | ||
| 9. Confirm the operator-owned list in §6 (TLS, authz-model choice, network | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As stated above, Hive Metastore should be protected at the application level, not at the network level. |
||
| isolation, UDF vetting). Anything mis-assigned? | ||
| 10. Confirm the by-design non-guarantees in §7. | ||
| 11. Is super-linear resource use / a hang on a pathological query a bug, or is | ||
| bounding it the operator's job (YARN queues / HS2 limits)? | ||
| 12. Confirm the real names + shipped defaults of the §8 levers (especially | ||
| `hive.server2.authentication` and the default authorization model). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm checking them. We have parameters to configure TLS on the Hive side. |
||
| 13. What do scanners/fuzzers/researchers most often report that you consider a | ||
| non-finding? (Feeds §11a.) | ||
|
|
||
| **Wave 4 — meta** | ||
| 14. Hive has no in-repo `SECURITY.md`/`THREAT_MODEL.md` today; this PR adds | ||
| them and wires `AGENTS.md → SECURITY.md → THREAT_MODEL.md`. Confirm this | ||
| in-repo model is canonical (vs the cwiki security pages), how it should | ||
| reference those pages, and who owns revisions. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some external services, such as Apache Spark, can directly access Hive Metastore. The direct metastore access should be in scope.
https://gist.github.com/okumin/f33574e96efd014fa6c1f5d4e13531ef
Off-topic: Should we have a separate
THREAT_MODEL.mdfor Hive Metastore if it is more convenient for AI? HiveServer2 and Hive Metastore have different security models and different parameters.