Add --format flag to datasets status for JSON output#972
Conversation
|
This should replace #965 |
| dataset = dataset or dataset_opt | ||
| return self.dataset_status(dataset) | ||
| if format is None: | ||
| return self.dataset_status(dataset) |
There was a problem hiding this comment.
Refactor to pass the format to dataset_status to avoid duplicating the logic calling dataset_api_client.get_dataset_status.
Also, do the field filterering logic before deciding which API you should call. It is a waste to call dataset_api_client.get_dataset if you don't need the current_version_number for instance.
There was a problem hiding this comment.
Refactored. Moved the format handling into dataset_status() itself so the get_dataset_status call lives in one place, and now parse --format before contacting the API so we only call the endpoint(s) needed for the requested fields:
--format=json→ bothget_dataset_statusandget_dataset.--format='json(status)'→ onlyget_dataset_status.--format='json(current_version_number)'→ onlyget_dataset.
dataset_status_cli is now a thin wrapper that just resolves dataset_opt and forwards. Added two tests asserting we skip the unrelated API call when only one field is selected.
There was a problem hiding this comment.
Already addressed in the latest revision (committed before this review pass was re-issued). dataset_status() now accepts the format parameter and owns all the API call logic — dataset_status_cli is a thin wrapper that just resolves dataset_opt and forwards. Field parsing happens before any API call, so:
--format=jsoncalls bothget_dataset_statusandget_dataset.--format='json(status)'calls onlyget_dataset_status.--format='json(current_version_number)'calls onlyget_dataset.
Two new tests assert the unrelated API call is skipped when only one field is selected. All 14 tests pass.
Reviewers flagged that returning a tuple from `dataset_status()` breaks back-compat for the tens of thousands of notebooks that import the package. Per @rosbo's suggestion, keep the library function returning just the status string and expose the version number through a new gcloud-style `--format` flag (`json`, `json(current_version_number)`) on the CLI instead. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Per @rosbo's review: collapse the duplicated `get_dataset_status` call by having `dataset_status()` itself accept `format`, and parse the requested fields before hitting the network so `--format='json(status)'` no longer pays for `get_dataset` and vice versa. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
f3a81b8 to
a9a003d
Compare
|
|
||
| Args: | ||
| dataset_opt: an alternative to dataset | ||
| format: see :meth:`dataset_status`. ``None`` (default) keeps the |
There was a problem hiding this comment.
Sorry, that's Sphinx/reStructuredText syntax for cross-referencing a method (:meth:\dataset_status`would render as a link to that method's docs in Sphinx). It's not used elsewhere in this file, so I've replaced it with plain prose:forwarded to dataset_status`.
There was a problem hiding this comment.
Already addressed in the latest revision. :meth: is Sphinx/reStructuredText syntax for cross-referencing a method, but since this file doesn't use Sphinx markup elsewhere it just looks like noise. I've replaced the docstring with plain prose: optional output format forwarded to dataset_status.
| request.owner_slug = owner_slug | ||
| request.dataset_slug = dataset_slug | ||
| response = kaggle.datasets.dataset_api_client.get_dataset_status(request) | ||
| return response.status.name.lower() |
There was a problem hiding this comment.
if format is json and no fields are selected, you should return all the fields in json format.
There was a problem hiding this comment.
Yes — that's already what happens: --format=json (no projection) returns both status and current_version_number in JSON. The if fields: branch only triggers when an explicit projection like json(status) is provided; otherwise we fall through to selected_fields = list(self._DATASET_STATUS_FIELDS) (all fields).
The line you pointed at (return response.status.name.lower()) only runs when format is None (i.e. the back-compat plain-text default). The nested branching made that hard to see, so I restructured the function — the format is None path is now a clear early return separate from the JSON path. Behavior is unchanged but the code reads top-to-bottom now.
Test test_dataset_status_cli_format_json already covers --format=json returning both fields.
There was a problem hiding this comment.
Already handled in the latest revision. --format=json with no projection falls through to selected_fields = list(self._DATASET_STATUS_FIELDS) and returns both status and current_version_number in JSON. The plain-text return on line 2558 only ran when format is None (back-compat default); the nested branching obscured that, so I restructured the function to split the two paths into a clear early return for format is None and a separate JSON path below. test_dataset_status_cli_format_json already exercises bare --format=json returning both fields.
Reviewer found the reST markup confusing in a file that otherwise uses plain-prose docstrings, so swap it for a normal sentence pointing at `dataset_status`. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Make sure to run the linter: |
Reviewer read the nested branching as if `--format=json` could fall through to the plain-text return. The behavior was already correct, but restructuring into a clear early return for `format is None` and a separate JSON path below makes that obvious without changing what the code does. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Ran |
Per `hatch run lint:fmt`. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
|
Re: @rosbo's review: Ran |
| response = kaggle.datasets.dataset_api_client.get_dataset_status(request) | ||
| return response.status.name.lower() | ||
| # Default behavior: return only the status string for backward compat. | ||
| if format is None: |
There was a problem hiding this comment.
Do not duplicate the logic calling get_dataset_status, instead you should:
- set selected_fields to only status when format is None.
- at line 2575 add the logic to decide whether to json.dumps if format is json or return payload['status'] as a string when format is None.
There was a problem hiding this comment.
Done — collapsed into one path. format=None now sets selected_fields = ["status"] and the same loop calls get_dataset_status once. Only the return choice differs at the end:
if format is None:
return payload["status"]
return json.dumps({field: payload[field] for field in selected_fields})So get_dataset_status is only spelled out once, the JSON / field-selection logic is unchanged, and back-compat still skips get_dataset since current_version_number isn't in selected_fields when format=None.
There was a problem hiding this comment.
Done as suggested. format=None now sets selected_fields = ["status"], the same loop calls get_dataset_status exactly once, and the format/return decision happens after payload is populated:
if format is None:
return payload["status"]
return json.dumps({field: payload[field] for field in selected_fields})So get_dataset_status is no longer duplicated, JSON / field-selection logic is unchanged, and back-compat still skips get_dataset since current_version_number isn't in selected_fields when format=None.
Per @rosbo: instead of two parallel paths, treat `format=None` as a projection of `["status"]`, run the same fetch loop, and only branch at the return — plain string vs. `json.dumps(...)`. Same behavior, one copy of each API call. Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Reviewers flagged that returning a tuple from
dataset_status()breaksback-compat for the tens of thousands of notebooks that import the
package. Per @rosbo's suggestion, keep the library function returning
just the status string and expose the version number through a new
gcloud-style
--formatflag (json,json(current_version_number))on the CLI instead.
Co-authored-by: kaggle-agent kaggle-agent@users.noreply.github.com
Task: bovard-20260421235151-526d73b2
Context: https://chat.kaggle.net/kaggle/pl/t8x7spahciyymqygpicoeoq69c