-
Notifications
You must be signed in to change notification settings - Fork 38
Description
We are using FETCH as part of the transient search pipeline for the SPOTLIGHT project (a commensal survey for FRBs/pulsars at the GMRT). We are currently facing an issue where FETCH often hangs in the middle of classification. This happens even though:
- The number of candidates is not very large.
- The model is run on an NVIDIA A100, with 80 GB of GPU memory.
Unfortunately we have not been able to reliably reproduce the bug. Currently it seems to happen randomly, and does not seem to be triggered by a particular candidate. We verified the latter by rerunning FETCH on the same candidate, and it runs successfully. Any idea what could be causing the issue? I am using tensorflow v2.15.0.post1, and keras v2.15.0, since higher versions just do not work, with Python 3.10.14. I am aware that the bug will be difficult to solve since there is no reproducibility (as far as we can see), but I thought I will still open an issue so that we can discuss what could be the possible causes at the very least.