Skip to content

Comments

Use UVParameter forms in UVData.__add__ and UVData.fast_concat to prevent missing parameters.#1606

Draft
bhazelton wants to merge 17 commits intomainfrom
add_concat_use_axis
Draft

Use UVParameter forms in UVData.__add__ and UVData.fast_concat to prevent missing parameters.#1606
bhazelton wants to merge 17 commits intomainfrom
add_concat_use_axis

Conversation

@bhazelton
Copy link
Member

Description

This is following some of the ideas in UVBase._select_along_axis. This could probably be generalized to UVBase, but I wanted to get UVData's version done to work out the ideas before moving it up UVBase. And I wanted to at least get a draft up for @rlbyrne to try out.

Motivation and Context

fixes #1595

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation change (documentation changes only)
  • Version change
  • Build or continuous integration change
  • Other

Checklist:

Bug fix checklist:

  • My fix includes a new test that breaks as a result of the bug (if possible).
  • I have updated the CHANGELOG.

@codecov
Copy link

codecov bot commented Aug 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.93%. Comparing base (65efcce) to head (347cfdd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1606      +/-   ##
==========================================
- Coverage   99.93%   99.93%   -0.01%     
==========================================
  Files          67       67              
  Lines       22688    22670      -18     
==========================================
- Hits        22674    22656      -18     
  Misses         14       14              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhazelton
Copy link
Member Author

bhazelton commented Aug 18, 2025

@kartographer this is a slightly different approach than you took in the Telescope.__add__ method. Before I start thinking about moving it up to UVBase I'd like to get your thoughts about whether there is something that could be done better.

@bhazelton bhazelton force-pushed the add_concat_use_axis branch 3 times, most recently from f6b1427 to 33b50d9 Compare August 19, 2025 23:38
@bhazelton bhazelton requested a review from kartographer August 25, 2025 23:08
@bhazelton bhazelton force-pushed the add_concat_use_axis branch 5 times, most recently from 945b3a6 to 34449d2 Compare September 24, 2025 00:32
@bhazelton bhazelton force-pushed the add_concat_use_axis branch 2 times, most recently from a1f87dc to 26ed06f Compare December 11, 2025 22:29
Copy link
Contributor

@steven-murray steven-murray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bhazelton and sorry for the very slow review! I have a few comments. Mostly I think we can make some of these functions a bit more clear in what they are doing. They seem like they will become quite central in how UVxxx objects operate, so we should make them as clear as possible for our own future reference.

It might also be useful to do a quick profiling for these functions, as I expect they might be bottleneck functions in some applications (especially in terms of memory)


def _get_param_axis(self, axis_name: str, single_named_axis: bool = False):
"""
Get a mapping of parameters that have a given axis to the axis number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use "attributes" instead of "parameters" here? I was confused what this function was doing until I realized you were talking about attributes.

Probably this will be even more easily cleared up by including a single simple example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem here in that the attributes are UVParameter objects. I'll work on the wording and an example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated the name and the docstring. Let me know if it's more clear.

new_array = np.concatenate(
[
getattr(self, param),
getattr(other, "_" + param).get_from_form(other_form_dict),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we couldn't just use np.take(getattr(other, param), other_inds, axis) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I didn't write get_from_form, it looks like it tries to do a slicing if possible and if it can't it just uses np.take, so I think it's already somewhat optimized? But probably worth doing some performance testing on.

Comment on lines +923 to +924
if param not in multi_axis_params:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not also padding single-axis objects? This is confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing! Among other things, this is handling the case where we have data divided into chunks along more than one axis and we're trying to combine it. If you combine first along one axis and then start adding in the first chunk along the next axis you have to pad out the multi-dimensional arrays with zeros (and flags) for the corners you don't have actual information yet. That doesn't come up for single axis parameters.

Imagine a 2D array split into quadrants. The first two (along any axis) add fine. When you add in the third, you need to pad the fourth quadrant with zeros (and flags). Then when we add in the fourth, if all the data are zero and flagged we allow it to overwrite that quadrant. Does that make sense? Happy to add some more comments or docstrings if you think it'd be helpful.

Comment on lines +951 to +953
order_dict : dict
dict giving the final sort indices for each axis (keys are axes, values
are index arrays for sorting).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this sorting occurs in multiple functions, might it be better to just have a standalone "arbitrary axis sorting helper"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, possibly. I think the reason I did it this way is that each UVParmeter is only affected by one of these functions, so the sorting is currently only done only once per parameter. It's just done in a different function depending on whether it's a single axis or multiaxis array. For the single axis array, the sorting can be done at the same time as the indexing with the np.take call, so I think that's why I did it this way. But open to refactoring if it's not a performance hit.

@bhazelton bhazelton force-pushed the add_concat_use_axis branch from 26ed06f to 347cfdd Compare January 27, 2026 23:24
@bhazelton
Copy link
Member Author

bhazelton commented Feb 20, 2026

@steven-murray I finally got memray up and running. I started with a 9 GB MWA uvfits file, which I split along the frequency axis into two objects which I saved out to uvh5 files (each file was ~3 GB -- not sure why that's so much less than half the uvfits size). Then I used memray to run two scripts on this branch and on main. The two scripts just read in the two files into a UVData object using from_file and passing a list containing the two file names, the only difference was that in one script I specified axis="freq" so it used the fast_concat method rather than the __add__ method.

I'm attaching the memray flame graph html files for all 4 runs as well as the pngs of the memory graph for all 4 runs (the png is just made from the html, use the html to get tooltip readings). My take away is that this refactor causes a small improvement on the memory usage for the __add__ method. There is very little difference for fast_concat but this refactor might be using very slightly more memory. I'm interested in your read of these:

memray-flamegraph-output_add_main.html
memray-flamegraph-output_add_refactor.html
memray-flamegraph-output_concat_main.html
memray-flamegraph-output_concat_refactor.html

main __add__:
memray_add_main

this branch __add__:
memray_add_refactor

main fast_concat:
memray_concat_main

this branch fast_concat :
memray_concat_refactor

I also did a similar test but with adding/concatenating over the blt axis (I split the input data in half by time) and saw very similar results.

@steven-murray
Copy link
Contributor

So it's good that this does not reduce memory performance, and maybe improves it a little.

It is weird that reading and concatenating two 3GB files results in 15 GB of peak memory usage at best (with fast_concat). I think we should have a good look at that as a team, but not in this PR.

@steven-murray
Copy link
Contributor

It looks like we're allocating enough to read the two files first (~6GB), then soon after allocating another amount equivalent to that (so up to ~12GB), and then a little later another ~3-4GB (not sure where that comes from).

Since peak memory can be a huge limitation, we should think about how we might get away with the peak memory being close to the final size of the arrays.

@bhazelton
Copy link
Member Author

@steven-murray I totally agree, now that I have memray working I am digging into memory usage more and we should definitely work to reduce it. But for this PR I am content that it doesn't make it worse.

@steven-murray
Copy link
Contributor

Cool. I think you answered my other points -- though this is still in draft so I can't be approving yet :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parameter scan_number_array is not updated during concatenation

2 participants