Skip to content

Update NMDC ingest script to include Study name in data_collections field #30

@eecavanna

Description

@eecavanna

As shown in the code snippet below, the ingest script currently only includes the study's ID and the URL to the study's page on the NMDC data portal; both of which it derives from information in the Biosample. Retrieving additional details about the study, such as its name and its description, will require fetching data from the study_set collection (via some Runtime API endpoint, such as GET /studies).

data/contrib/nmdc/ingest.py

Lines 152 to 169 in 87fab60

def get_part_of_collection(self) -> list[bertron.DataCollection]:
"""Returns a list of `DataCollection` instances, each describing one of the Biosample's associated studies.
References:
- https://ber-data.github.io/bertron-schema/DataCollection/
- https://microbiomedata.github.io/nmdc-schema/associated_studies/
TODO: Retrieve the name and description of the Study from the NMDC Runtime API, then include it here.
"""
data_collections = []
if self.associated_studies is not None and len(self.associated_studies) > 0:
for study_id in self.associated_studies:
data_collection = bertron.DataCollection(
id=study_id,
url=f"https://api.microbiomedata.org/studies/{study_id}",
)
data_collections.append(data_collection)
return data_collections

I think this will be a straightforward change to make, but may require renaming some variables and wrapping the cached data within a higher-level JSON object (e.g. one that has a biosamples property and a studies property).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions