Skip to content

Add Rucio/Chaining examples for opening full datasets by DID#22

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/create-python-scripts-rucio-datasets
Draft

Add Rucio/Chaining examples for opening full datasets by DID#22
Copilot wants to merge 2 commits intomainfrom
copilot/create-python-scripts-rucio-datasets

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 10, 2026

No existing examples showed how to open a full Rucio dataset (all files) by DID — only single-file access was demonstrated. Adds three self-contained scripts under Rucio/Chaining/ using the Rucio Python client API directly.

Scripts

  • uproot_example.py — lists dataset files, resolves PFNs, reads EventHeader.eventNumber across all files with uproot, plots histogram via matplotlib
  • tchain_example.py — same PFN resolution, chains all files with ROOT.TChain
  • rdataframe_example.py — same PFN resolution, processes all files with ROOT.RDataFrame

Key pattern: getting all PFNs

nrandom=1 in list_replicas selects 1 file from the entire dataset (not 1 replica per file). The correct approach iterates replica['pfns'] whose keys are the PFN URLs:

file_paths = [
    next(iter(replica['pfns']))  # dict keys are the PFN URLs
    for replica in client.list_replicas(dids, rse_expression='isopenaccess=true')
    if replica['pfns']
]

This yields one PFN per file across the full dataset, skipping any files with no open-access replicas.

Original prompt

Create Python scripts demonstrating how to open and process full Rucio datasets by their DID (Data Identifier) using both uproot and ROOT.

Directory Structure

Create a new directory Rucio/Chaining/ containing three Python scripts.

Files to Create

1. Rucio/Chaining/uproot_example.py

from rucio.client import Client
import uproot
import matplotlib.pyplot as plt
import numpy as np

# Initialize Rucio client
client = Client()

# Define the dataset DID
dataset_did = "epic:/RECO/26.02.0/epic_craterlake/SINGLE/e+/500MeV/3to50deg"
scope, name = dataset_did.split(':', 1)

# Get the list of files in the dataset
files = list(client.list_files(scope, name))
dids = [{'scope': f['scope'], 'name': f['name']} for f in files]

# Get one replica PFN for each file in the dataset
file_paths = [
    next(iter(replica['pfns']))  # Get first PFN URL (dict keys are the URLs)
    for replica in client.list_replicas(dids, rse_expression='isopenaccess=true')
    if replica['pfns']
]

# Collect EventHeader.eventNumber from all files
event_numbers = []
for file_path in file_paths:
    with uproot.open(file_path) as f:
        tree = f["events"]  # Replace "events" with the actual tree name
        event_numbers.extend(tree["EventHeader.eventNumber"].array())

# Create histogram
plt.hist(event_numbers, bins=50)
plt.xlabel("Event Number")
plt.ylabel("Count")
plt.title("EventHeader.eventNumber Distribution")
plt.savefig("event_number_distribution.png")
print("Saved histogram to event_number_distribution.png")

2. Rucio/Chaining/tchain_example.py

from rucio.client import Client
import ROOT

# Initialize Rucio client
client = Client()

# Define the dataset DID
dataset_did = "epic:/RECO/26.02.0/epic_craterlake/SINGLE/e+/500MeV/3to50deg"
scope, name = dataset_did.split(':', 1)

# Get the list of files in the dataset
files = list(client.list_files(scope, name))
dids = [{'scope': f['scope'], 'name': f['name']} for f in files]

# Get one replica PFN for each file in the dataset
file_paths = [
    next(iter(replica['pfns']))  # Get first PFN URL (dict keys are the URLs)
    for replica in client.list_replicas(dids, rse_expression='isopenaccess=true')
    if replica['pfns']
]

# Create a TChain to process all files as a single dataset
# Replace "events" with the actual tree name in your files
chain = ROOT.TChain("events")

for file_path in file_paths:
    chain.Add(file_path)

# Create histogram of EventHeader.eventNumber
canvas = ROOT.TCanvas("c1", "Event Number Distribution", 800, 600)
chain.Draw("EventHeader.eventNumber>>h_eventNumber(100)")
canvas.SaveAs("event_number_distribution.png")

# Print total entries
print(f"Total entries in dataset: {chain.GetEntries()}")
print("Saved histogram to event_number_distribution.png")

3. Rucio/Chaining/rdataframe_example.py

from rucio.client import Client
import ROOT

# Initialize Rucio client
client = Client()

# Define the dataset DID
dataset_did = "epic:/RECO/26.02.0/epic_craterlake/SINGLE/e+/500MeV/3to50deg"
scope, name = dataset_did.split(':', 1)

# Get the list of files in the dataset
files = list(client.list_files(scope, name))
dids = [{'scope': f['scope'], 'name': f['name']} for f in files]

# Get one replica PFN for each file in the dataset
file_paths = [
    next(iter(replica['pfns']))  # Get first PFN URL (dict keys are the URLs)
    for replica in client.list_replicas(dids, rse_expression='isopenaccess=true')
    if replica['pfns']
]

# Create an RDataFrame with all files in the dataset
# Replace "events" with the actual tree name
rdf = ROOT.RDataFrame("events", file_paths)

# Create histogram of EventHeader.eventNumber
h_eventNumber = rdf.Histo1D(
    ("h_eventNumber", "Event Number Distribution;Event Number;Count", 100, 0, 100),
    "EventHeader.eventNumber"
)

# Draw and save the histogram
canvas = ROOT.TCanvas("c1", "Event Number Distribution", 800, 600)
h_eventNumber.Draw()
canvas.SaveAs("event_number_distribution.png")

# Print total entries
print(f"Total entries in dataset: {rdf.Count().GetValue()}")
print("Saved histogram to event_number_distribution.png")

Requirements

  • All scripts should be executable Python files
  • Scripts demonstrate opening a full Rucio dataset by DID
  • Each uses a different approach: uproot with matplotlib, ROOT TChain, and ROOT RDataFrame
  • All scripts create a histogram of EventHeader.eventNumber and save it as PNG
  • Use the Rucio Python client API (not subprocess calls)
  • Include informative print statements

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: This page includes examples of how to open a single file, but no example of how a full dataset can be loaded. Develop a python example for both uproot and ROOT that opens a full dataset by its DID.

Assistant: I'll help you create Python examples for opening a full dataset by its DID using both uproot and ROOT. These examples will use Rucio to get the list of files in the...

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: wdconinc <4656391+wdconinc@users.noreply.github.com>
Copilot AI changed the title [WIP] Create Python scripts for processing Rucio datasets using uproot and ROOT Add Rucio/Chaining examples for opening full datasets by DID Mar 10, 2026
@wdconinc
Copy link
Copy Markdown
Contributor

@sjdkay FYI re: file access tutorial. May be useful to integrate some rucio file access patterns to avoid people downloading lots of files locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants