Replies: 20 comments 5 replies
-
|
Thanks for bringing this over here. I completely missed this issue when I continued our thread over at mhahsler/TSP#22 (comment): You also wrote (mhahsler/TSP#22 (comment)):
That's very useful information. How many CPU cores does your WSL2 system have/see, e.g. what does the following report? > parallelly::availableCores(which = "all")One hypothesis now is that it's over-parallelizing, but still, that should not cause it to freeze. Looking at your
|
Beta Was this translation helpful? Give feedback.
-
When I use Then it hangs as well. It seems to work fine up to 8 workers. |
Beta Was this translation helpful? Give feedback.
-
|
Informative, but really odd. Did you run this in RStudio or from the terminal? FYI, nworkers <- parallelly::availableCores()
cl <- parallelly::makeClusterPSOCK(nworkers)
plan(cluster, workers = cl)which is nearly the same as: nworkers <- parallelly::availableCores()
cl <- parallel::makeCluster(nworkers)
plan(cluster, workers = cl)The corresponding doParallel setup would be: nworkers <- parallelly::availableCores()
cl <- parallel::makeCluster(nworkers)
doParallel::registerDoParallel(cl)You're saying "doparallel works on WSL2", but it's not clear to me exactly how you set it up and with how many workers. doParallel is actually providing two completely different types on parallel workers. Knowing the details will help narrow in on the cause - I suspect your problem can be reproduced with vanilla parallel code, but hard to tell right now. PS. Please try to update your R packages - it helps to rule out stuff. |
Beta Was this translation helpful? Give feedback.
-
|
I tried it using both. I wonder if it is some configuration issue on my machine only. |
Beta Was this translation helpful? Give feedback.
-
|
I see. It's really hard for me to help out any further here, if I don't fully understand your set up (e.g. if you use RStudio or vanilla R, if your packages are up-to-date, and exactly how you use Another thing to try is with forked workers, e.g. library(futurize)
plan(multicore) ## forked workers
slow_fcn <- function(x) {
Sys.sleep(0.1) # emulate work
x^2
}
xs <- 1:1000
ys <- lapply(xs, slow_fcn) |> futurize()There's also One can also see if it can be reproduced without futurize (which I doubt matters here) and future.apply (which I give a slight probability to be involved) and see if it is happens with just future, e.g. library(future)
plan(multisession)
futureSessionInfo()library(future)
plan(multisession, workers = 8)
futureSessionInfo()library(future)
plan(multisession, workers = 9)
futureSessionInfo() |
Beta Was this translation helpful? Give feedback.
-
|
I think the issue is that setting up the session gets stuck. I get: It stops right there and does not report the other nodes. I waited for more than a minute. It works reliably with 2 or 4 workers and sometimes with 6. Anything more, and it gets stuck. Maybe a weird interaction between the Windows host and the Ubuntu WSL2 instance? -Michael |
Beta Was this translation helpful? Give feedback.
-
|
Btw: |
Beta Was this translation helpful? Give feedback.
-
|
Thanks. This is all very helpful and definitely a big step forward.
We've now excluded both futurize and future.apply from the equation (and RStudio). I suspect that this is also independent of the future package, but before we can rule that out, the next thing to test would be: library(future)
cl <- parallel::makeCluster(14) ## Set up 14 vanilla PSOCK cluster workers
print(cl)
plan(cluster, workers = cl)
futureSessionInfo()What is the output and does that also freeze? If not, what about: library(future)
cl <- parallel::makeCluster(14, type = parallelly::RPSOCK) ## Set up 14 "enhanced" PSOCK cluster workers
print(cl)
plan(cluster, workers = cl)
futureSessionInfo() |
Beta Was this translation helpful? Give feedback.
-
|
Both versions freeze when I request more than 2 workers. It looks like the connections are open. but then |
Beta Was this translation helpful? Give feedback.
-
|
Perfect. To peel off more, next would be to see if we can reproduce it with manually created futures; does the following also hang? library(future)
cl <- parallel::makeCluster(14)
plan(cluster, workers = cl)
fs <- lapply(seq_along(cl), FUN = function(ii) {
future({
Sys.sleep(1.0)
data.frame(worker = ii, pid = Sys.getpid(), r = getRversion(), as.list(Sys.info()))
})
})
vs <- value(fs)
str(vs)If so, what about: library(future)
cl <- parallel::makeCluster(14)
plan(cluster, workers = cl)
fs <- lapply(seq_along(cl), FUN = function(ii) {
future({
Sys.sleep(1.0)
data.frame(worker = ii, pid = Sys.getpid(), r = getRversion(), as.list(Sys.info()))
}, stdout = FALSE, conditions = character(0), globals = list(ii = ii))
})
vs <- value(fs)
str(vs)? |
Beta Was this translation helpful? Give feedback.
-
|
both freeze. Using parallel directly (clusterApply) works as expected: However, if I add a |
Beta Was this translation helpful? Give feedback.
-
|
Wow, interesting. There might be a bug in R here around how To rule out more things, does the following also freeze on you? cl <- parallel::makeCluster(14)
y <- parallel::clusterApply(cl, seq_along(cl), function(ii) {
Sys.sleep(1.0)
})
str(y)PS. You don't need If that freezes, what about cl <- parallel::makeCluster(14)
y <- parallel::clusterEvalQ(cl, Sys.sleep(1.0))
str(y)? BTW, what you found means that your original |
Beta Was this translation helpful? Give feedback.
-
|
Hello, I haven't heard back from you here. Did you resolve this? |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I was traveling and had no time to look into this. I wonder if it is related to a weird WSL2 setup that my university uses. I think we need to see if someone with a vanilla installation of Windows/WSL2 has the same issues. |
Beta Was this translation helpful? Give feedback.
-
|
Yes, maybe. But if you could try what I wrote in my last comment, then we can narrow in on the problem further, which then simplifies asking others in your organization to check what they get. So, first goal is to identify a minimal reproducible example. |
Beta Was this translation helpful? Give feedback.
-
and hang. If I set the sleep to < .4, then they both return after exactly the set time. without the sleep works. |
Beta Was this translation helpful? Give feedback.
-
|
(I transferred your original issue to a Futureverse discussion topic) I think it's excellent that we've now narrowed in on a reproducible example that only involves base R. As a next step, can you please cut'n'paste the full output of: nworkers <- 14
cl <- parallel::makeCluster(nworkers, outfile = "")
trace(parallel:::sendData)
trace(parallel:::recvData)
y <- parallel::clusterEvalQ(cl, { pid <- Sys.getpid(); print(pid); Sys.sleep(1.0); pid })? For example, when I do this with This will help us to see if it stalls at the |
Beta Was this translation helpful? Give feedback.
-
|
Great - we're getting closer. That clearly shows that all parallel workers are started, that they all get a task sent to them, and they start executing the tasks. I realized I could have come up with a tad better test code, but I've fixed that below. At this point, I strongly suspect there is a bug in base R around this and I have a hunch were it might be. If we can move swiftly, I hope that we can get it fixed for the code freeze for R 4.6.0 on April 17.
This is another good clue. Let's see if it happens more frequently for four workers if you increase the sleep time, e.g. 5 seconds. First, create a script named nworkers <- 14 # <= feel free to play with this
cl <- parallel::makeCluster(nworkers, outfile = "")
print(cl)
trace(parallel:::sendData, tracer = quote(message("parent: sendData()")), print = FALSE)
trace(parallel:::recvData, tracer = quote(message("parent: recvData()")), print = FALSE)
y <- parallel::clusterEvalQ(cl, {
pid <- Sys.getpid()
trace(parallel:::sendData, tracer = quote(message(sprintf("worker %d: sendData()", pid))), print = FALSE)
trace(parallel:::recvData, tracer = quote(message(sprintf("worker %d: recvData()", pid))), print = FALSE)
print(pid)
Sys.sleep(5.0) # <= updated duration
print(pid) # <= new
pid
})
str(y)
print(cl)
parallel::stopCluster(cl)Then, start R and source that script: $ R
...
> source("parallel-sleep-bug.R")
...That should still hang with 14 workers. You can then lower to 10, 8, 4 to see if they're all hangs. If they do, that is more evidence for my current guess what's going on. Second, and importantly, let's try to run the same script non-interactively from the WSL2 terminal, i.e. $ Rscript parallel-sleep-bug.RDoes it hang now, e.g. with 14, 10, 8, 4 workers? |
Beta Was this translation helpful? Give feedback.
-
|
I cannot install R-devel on this machine. Were you able to replicate the issue on a separate Windows machine to make sure the issue is not just due to my installation? |
Beta Was this translation helpful? Give feedback.
-
|
I looked in the WSL2 config and saw that it contains: After starting WSL without that line, running |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is quite exciting and might also be useful for my seriation package. I have tried some simple futurize examples on WSL2. Here is an example:
It first spawns some R-session instances, but then hangs and does not finish the job. It falls back to a single process with little CPU utilization, but does not finish. After esc/CTRL-C I get:
doparallel works on WSL2. Any ideas.
Version used:
Beta Was this translation helpful? Give feedback.
All reactions