Skip to content

Add --format flag to datasets status for JSON output#972

Merged
bovard merged 6 commits intomainfrom
agent/bovard-20260421235151-526d73b2
Apr 24, 2026
Merged

Add --format flag to datasets status for JSON output#972
bovard merged 6 commits intomainfrom
agent/bovard-20260421235151-526d73b2

Conversation

@kaggle-agent
Copy link
Copy Markdown
Collaborator

Reviewers flagged that returning a tuple from dataset_status() breaks
back-compat for the tens of thousands of notebooks that import the
package. Per @rosbo's suggestion, keep the library function returning
just the status string and expose the version number through a new
gcloud-style --format flag (json, json(current_version_number))
on the CLI instead.

Co-authored-by: kaggle-agent kaggle-agent@users.noreply.github.com


Task: bovard-20260421235151-526d73b2
Context: https://chat.kaggle.net/kaggle/pl/t8x7spahciyymqygpicoeoq69c

@stevemessick
Copy link
Copy Markdown
Contributor

@bovard How is this related to #965?

@bovard
Copy link
Copy Markdown
Member

bovard commented Apr 24, 2026

This should replace #965

@kaggle-agent
Copy link
Copy Markdown
Collaborator Author

Re: @bovard's review:

Closed PR #965 with a comment pointing reviewers to #972 as the replacement. #972 addresses the back-compat concern from @stevemessick (keeps dataset_status() returning str) and implements the gcloud-style --format flag suggested by @rosbo.

@bovard bovard requested a review from rosbo April 24, 2026 17:06
Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
dataset = dataset or dataset_opt
return self.dataset_status(dataset)
if format is None:
return self.dataset_status(dataset)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to pass the format to dataset_status to avoid duplicating the logic calling dataset_api_client.get_dataset_status.

Also, do the field filterering logic before deciding which API you should call. It is a waste to call dataset_api_client.get_dataset if you don't need the current_version_number for instance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored. Moved the format handling into dataset_status() itself so the get_dataset_status call lives in one place, and now parse --format before contacting the API so we only call the endpoint(s) needed for the requested fields:

  • --format=json → both get_dataset_status and get_dataset.
  • --format='json(status)' → only get_dataset_status.
  • --format='json(current_version_number)' → only get_dataset.

dataset_status_cli is now a thin wrapper that just resolves dataset_opt and forwards. Added two tests asserting we skip the unrelated API call when only one field is selected.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the latest revision (committed before this review pass was re-issued). dataset_status() now accepts the format parameter and owns all the API call logic — dataset_status_cli is a thin wrapper that just resolves dataset_opt and forwards. Field parsing happens before any API call, so:

  • --format=json calls both get_dataset_status and get_dataset.
  • --format='json(status)' calls only get_dataset_status.
  • --format='json(current_version_number)' calls only get_dataset.

Two new tests assert the unrelated API call is skipped when only one field is selected. All 14 tests pass.

bovard and others added 2 commits April 24, 2026 20:02
Reviewers flagged that returning a tuple from `dataset_status()` breaks
back-compat for the tens of thousands of notebooks that import the
package. Per @rosbo's suggestion, keep the library function returning
just the status string and expose the version number through a new
gcloud-style `--format` flag (`json`, `json(current_version_number)`)
on the CLI instead.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
Per @rosbo's review: collapse the duplicated `get_dataset_status` call by
having `dataset_status()` itself accept `format`, and parse the
requested fields before hitting the network so `--format='json(status)'`
no longer pays for `get_dataset` and vice versa.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent kaggle-agent force-pushed the agent/bovard-20260421235151-526d73b2 branch from f3a81b8 to a9a003d Compare April 24, 2026 20:02
Comment thread src/kaggle/api/kaggle_api_extended.py Outdated

Args:
dataset_opt: an alternative to dataset
format: see :meth:`dataset_status`. ``None`` (default) keeps the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does :meth: mean here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's Sphinx/reStructuredText syntax for cross-referencing a method (:meth:\dataset_status`would render as a link to that method's docs in Sphinx). It's not used elsewhere in this file, so I've replaced it with plain prose:forwarded to dataset_status`.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the latest revision. :meth: is Sphinx/reStructuredText syntax for cross-referencing a method, but since this file doesn't use Sphinx markup elsewhere it just looks like noise. I've replaced the docstring with plain prose: optional output format forwarded to dataset_status.

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
request.owner_slug = owner_slug
request.dataset_slug = dataset_slug
response = kaggle.datasets.dataset_api_client.get_dataset_status(request)
return response.status.name.lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if format is json and no fields are selected, you should return all the fields in json format.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — that's already what happens: --format=json (no projection) returns both status and current_version_number in JSON. The if fields: branch only triggers when an explicit projection like json(status) is provided; otherwise we fall through to selected_fields = list(self._DATASET_STATUS_FIELDS) (all fields).

The line you pointed at (return response.status.name.lower()) only runs when format is None (i.e. the back-compat plain-text default). The nested branching made that hard to see, so I restructured the function — the format is None path is now a clear early return separate from the JSON path. Behavior is unchanged but the code reads top-to-bottom now.

Test test_dataset_status_cli_format_json already covers --format=json returning both fields.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already handled in the latest revision. --format=json with no projection falls through to selected_fields = list(self._DATASET_STATUS_FIELDS) and returns both status and current_version_number in JSON. The plain-text return on line 2558 only ran when format is None (back-compat default); the nested branching obscured that, so I restructured the function to split the two paths into a clear early return for format is None and a separate JSON path below. test_dataset_status_cli_format_json already exercises bare --format=json returning both fields.

Reviewer found the reST markup confusing in a file that otherwise uses
plain-prose docstrings, so swap it for a normal sentence pointing at
`dataset_status`.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@rosbo
Copy link
Copy Markdown
Contributor

rosbo commented Apr 24, 2026

Make sure to run the linter: hatch run lint:fmt.

Reviewer read the nested branching as if `--format=json` could fall
through to the plain-text return. The behavior was already correct, but
restructuring into a clear early return for `format is None` and a
separate JSON path below makes that obvious without changing what the
code does.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent
Copy link
Copy Markdown
Collaborator Author

Ran black (24.10.0) over src/kaggle/api/kaggle_api_extended.py, src/kaggle/cli.py, and tests/test_dataset_status.py. Only the test file needed reformatting (collapsed a couple of wrapped format= kwarg lines onto one line at the project's 120-col limit). black --check is now clean for the files I touched.

Per `hatch run lint:fmt`.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@kaggle-agent
Copy link
Copy Markdown
Collaborator Author

Re: @rosbo's review:

Ran black 24.10.0 (the version pinned in pyproject.toml's lint env) over src/kaggle/api/kaggle_api_extended.py, src/kaggle/cli.py, and tests/test_dataset_status.py. Only the test file needed reformatting — black collapsed a couple of wrapped format= kwarg lines onto one line at the project's 120-col limit. black --check is now clean for the files I touched.

Comment thread src/kaggle/api/kaggle_api_extended.py Outdated
response = kaggle.datasets.dataset_api_client.get_dataset_status(request)
return response.status.name.lower()
# Default behavior: return only the status string for backward compat.
if format is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not duplicate the logic calling get_dataset_status, instead you should:

  • set selected_fields to only status when format is None.
  • at line 2575 add the logic to decide whether to json.dumps if format is json or return payload['status'] as a string when format is None.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — collapsed into one path. format=None now sets selected_fields = ["status"] and the same loop calls get_dataset_status once. Only the return choice differs at the end:

if format is None:
    return payload["status"]
return json.dumps({field: payload[field] for field in selected_fields})

So get_dataset_status is only spelled out once, the JSON / field-selection logic is unchanged, and back-compat still skips get_dataset since current_version_number isn't in selected_fields when format=None.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as suggested. format=None now sets selected_fields = ["status"], the same loop calls get_dataset_status exactly once, and the format/return decision happens after payload is populated:

if format is None:
    return payload["status"]
return json.dumps({field: payload[field] for field in selected_fields})

So get_dataset_status is no longer duplicated, JSON / field-selection logic is unchanged, and back-compat still skips get_dataset since current_version_number isn't in selected_fields when format=None.

Per @rosbo: instead of two parallel paths, treat `format=None` as a
projection of `["status"]`, run the same fetch loop, and only branch at
the return — plain string vs. `json.dumps(...)`. Same behavior, one copy
of each API call.

Co-authored-by: kaggle-agent <kaggle-agent@users.noreply.github.com>
@bovard bovard merged commit a8e3758 into main Apr 24, 2026
10 checks passed
@bovard bovard deleted the agent/bovard-20260421235151-526d73b2 branch April 24, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants