Skip to content

Commit c2a9b82

Browse files
EuphoricThinkingbb-ur
authored andcommitted
add support for batched queue submissions (#19769)
Adding a new feature: batched queue submissions. Batched queues enable submission of operations to the driver in batches, therefore reducing the overhead of submitting every single operation individually. Similarly to command buffers in L0v2, they use regular command lists (later referenced as 'batches'). Operations enqueued on regular command lists are not executed immediately, but only after enqueueing the regular command list on an immediate command list. However, in contrast to command buffers, batched queues also handle submission of batches (regular command lists) instead of only collecting enqueued operations, by using an internal immediate command list. Batched queues introduce: - batch_manager stores the current batch, the command list manager with an immediate command list for batch submissions, the vector of submitted batches, the generation number of the current batch. - The current batch is a command list manager with a regular command list; operations requested by users are enqueued on the current batch. The current batch may be submitted for execution on the immediate command list, replaced by a new regular command list and stored for execution completion in the vector of submitted batches. - The number of regular command lists stored for execution is limited. - The generation number of the current batch is assigned to events associated with operations enqueued on the given batch. It is incremented during every replacement of the current batch. When an event created by a batched queue appears in an eventWaitList, the batch assigned to the given event might not have been executed yet and the event might never be signalled. Comparing generation numbers enables determining whether the current batch should be submitted for execution. If the generation number of the current batch is higher than the number assigned to the given event, the batch associated with the event has already been submitted for execution and additional submission of the current batch is not needed. - Regular command lists use the regular pool cache type, whereas immediate command lists use the immediate pool cache type. Since user-requested operations are enqueued on regular command lists and immediate command lists are only used internally by the batched queue implementation, events are not created for immediate command lists (in most cases; see below). - When a user requests the command list manager to enqueue a command buffer, the regular command list from the command buffer is appended to the command list of the given command list manager. Since regular command lists cannot be enqueued on other regular command lists, but only on immediate command lists, enqueueing command buffers must be performed on an immediate command list. Therefore, an additional event pool with the immediate cache type is introduced in order to provide events for operations requested by users and enqueued directly on an immediate command list. - wait_list_view is modified. Previously, it only stored the waitlist (as a ze_event_handle buffer created from events) and the corresponding event count in a single container, which could be passed as an argument to the driver API. Currently, the constructor also ensures that all associated operations will eventually be executed. Since regular command lists are not executed immediately, but only after enqueueing on immediate lists, it is necessary to enqueue the regular command list associated with the given event. Otherwise, the event would never be signalled. Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has been added for native CPU, which is required by the enqueueTimestampRecording tests. Currently, enqueueTimestampRecording is not supported by batched queues. Batched queues can be enabled by setting UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally, through the environment variable UR_L0_V2_FORCE_BATCHED=1. Batched queues are intended to improve performance on platforms, where eager submission is not efficient due to driver limitations. Such hardware includes Xe (and older GPUs) on Windows. There are also workloads which benefit from batched submissions (e.g., dl-cifar). SYCL graphs should be preferred for new software, since they allow for better control of grouped commands submissions. Benchmark results for default in-order queues (sycl branch, commit hash: b76f12e554760c3fcfc55f1f815a76b0d8b208ad) and batched queues: api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs api_overhead_benchmark_ur SubmitKernel batched: 12.183 μs
1 parent e64444b commit c2a9b82

23 files changed

+2376
-471
lines changed

scripts/templates/queue_api.hpp.mako

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,9 @@ from templates import helper as th
2525
#pragma once
2626

2727
#include <ur_api.h>
28+
#include "queue_extensions.hpp"
2829

29-
struct ur_queue_t_ {
30+
struct ur_queue_t_ : ur_queue_extensions {
3031
virtual ~ur_queue_t_();
3132

3233
%for obj in th.get_queue_related_functions(specs, n, tags):

source/adapters/level_zero/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ if(UR_BUILD_ADAPTER_L0_V2)
171171
${CMAKE_CURRENT_SOURCE_DIR}/v2/memory.hpp
172172
${CMAKE_CURRENT_SOURCE_DIR}/v2/lockable.hpp
173173
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_api.hpp
174+
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_batched.hpp
174175
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_in_order.hpp
175176
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_out_of_order.hpp
176177
${CMAKE_CURRENT_SOURCE_DIR}/v2/usm.hpp
@@ -187,6 +188,7 @@ if(UR_BUILD_ADAPTER_L0_V2)
187188
${CMAKE_CURRENT_SOURCE_DIR}/v2/kernel.cpp
188189
${CMAKE_CURRENT_SOURCE_DIR}/v2/memory.cpp
189190
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_api.cpp
191+
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_batched.cpp
190192
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_create.cpp
191193
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_in_order.cpp
192194
${CMAKE_CURRENT_SOURCE_DIR}/v2/queue_immediate_out_of_order.cpp

source/adapters/level_zero/v2/command_buffer.cpp

Lines changed: 62 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
#include "../command_buffer_command.hpp"
1313
#include "../helpers/kernel_helpers.hpp"
1414
#include "../ur_interface_loader.hpp"
15+
#include "command_list_manager.hpp"
1516
#include "logger/ur_logger.hpp"
1617
#include "queue_handle.hpp"
1718

@@ -328,9 +329,12 @@ ur_result_t urCommandBufferAppendKernelLaunchExp(
328329
auto eventsWaitList = commandBuffer->getWaitListFromSyncPoints(
329330
syncPointWaitList, numSyncPointsInWaitList);
330331

332+
wait_list_view waitListView =
333+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
334+
331335
UR_CALL(commandListLocked->appendKernelLaunch(
332336
hKernel, workDim, pGlobalWorkOffset, pGlobalWorkSize, pLocalWorkSize,
333-
nullptr, numSyncPointsInWaitList, eventsWaitList,
337+
nullptr, waitListView,
334338
commandBuffer->createEventIfRequested(retSyncPoint)));
335339

336340
return UR_RESULT_SUCCESS;
@@ -353,8 +357,11 @@ ur_result_t urCommandBufferAppendUSMMemcpyExp(
353357
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
354358
pSyncPointWaitList, numSyncPointsInWaitList);
355359

360+
wait_list_view waitListView =
361+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
362+
356363
UR_CALL(commandListLocked->appendUSMMemcpy(
357-
false, pDst, pSrc, size, numSyncPointsInWaitList, eventsWaitList,
364+
false, pDst, pSrc, size, waitListView,
358365
hCommandBuffer->createEventIfRequested(pSyncPoint)));
359366

360367
return UR_RESULT_SUCCESS;
@@ -380,9 +387,12 @@ ur_result_t urCommandBufferAppendMemBufferCopyExp(
380387
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
381388
pSyncPointWaitList, numSyncPointsInWaitList);
382389

390+
wait_list_view waitListView =
391+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
392+
383393
UR_CALL(commandListLocked->appendMemBufferCopy(
384-
hSrcMem, hDstMem, srcOffset, dstOffset, size, numSyncPointsInWaitList,
385-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
394+
hSrcMem, hDstMem, srcOffset, dstOffset, size, waitListView,
395+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
386396

387397
return UR_RESULT_SUCCESS;
388398
} catch (...) {
@@ -407,9 +417,12 @@ ur_result_t urCommandBufferAppendMemBufferWriteExp(
407417
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
408418
pSyncPointWaitList, numSyncPointsInWaitList);
409419

420+
wait_list_view waitListView =
421+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
422+
410423
UR_CALL(commandListLocked->appendMemBufferWrite(
411-
hBuffer, false, offset, size, pSrc, numSyncPointsInWaitList,
412-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
424+
hBuffer, false, offset, size, pSrc, waitListView,
425+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
413426

414427
return UR_RESULT_SUCCESS;
415428
} catch (...) {
@@ -432,9 +445,12 @@ ur_result_t urCommandBufferAppendMemBufferReadExp(
432445
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
433446
pSyncPointWaitList, numSyncPointsInWaitList);
434447

448+
wait_list_view waitListView =
449+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
450+
435451
UR_CALL(commandListLocked->appendMemBufferRead(
436-
hBuffer, false, offset, size, pDst, numSyncPointsInWaitList,
437-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
452+
hBuffer, false, offset, size, pDst, waitListView,
453+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
438454

439455
return UR_RESULT_SUCCESS;
440456
} catch (...) {
@@ -461,10 +477,13 @@ ur_result_t urCommandBufferAppendMemBufferCopyRectExp(
461477
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
462478
pSyncPointWaitList, numSyncPointsInWaitList);
463479

480+
wait_list_view waitListView =
481+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
482+
464483
UR_CALL(commandListLocked->appendMemBufferCopyRect(
465484
hSrcMem, hDstMem, srcOrigin, dstOrigin, region, srcRowPitch,
466-
srcSlicePitch, dstRowPitch, dstSlicePitch, numSyncPointsInWaitList,
467-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
485+
srcSlicePitch, dstRowPitch, dstSlicePitch, waitListView,
486+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
468487

469488
return UR_RESULT_SUCCESS;
470489
} catch (...) {
@@ -491,10 +510,12 @@ ur_result_t urCommandBufferAppendMemBufferWriteRectExp(
491510
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
492511
pSyncPointWaitList, numSyncPointsInWaitList);
493512

513+
wait_list_view waitListView =
514+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
515+
494516
UR_CALL(commandListLocked->appendMemBufferWriteRect(
495517
hBuffer, false, bufferOffset, hostOffset, region, bufferRowPitch,
496-
bufferSlicePitch, hostRowPitch, hostSlicePitch, pSrc,
497-
numSyncPointsInWaitList, eventsWaitList,
518+
bufferSlicePitch, hostRowPitch, hostSlicePitch, pSrc, waitListView,
498519
hCommandBuffer->createEventIfRequested(pSyncPoint)));
499520

500521
return UR_RESULT_SUCCESS;
@@ -522,10 +543,12 @@ ur_result_t urCommandBufferAppendMemBufferReadRectExp(
522543
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
523544
pSyncPointWaitList, numSyncPointsInWaitList);
524545

546+
wait_list_view waitListView =
547+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
548+
525549
UR_CALL(commandListLocked->appendMemBufferReadRect(
526550
hBuffer, false, bufferOffset, hostOffset, region, bufferRowPitch,
527-
bufferSlicePitch, hostRowPitch, hostSlicePitch, pDst,
528-
numSyncPointsInWaitList, eventsWaitList,
551+
bufferSlicePitch, hostRowPitch, hostSlicePitch, pDst, waitListView,
529552
hCommandBuffer->createEventIfRequested(pSyncPoint)));
530553

531554
return UR_RESULT_SUCCESS;
@@ -548,9 +571,12 @@ ur_result_t urCommandBufferAppendUSMFillExp(
548571
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
549572
pSyncPointWaitList, numSyncPointsInWaitList);
550573

574+
wait_list_view waitListView =
575+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
576+
551577
UR_CALL(commandListLocked->appendUSMFill(
552-
pMemory, patternSize, pPattern, size, numSyncPointsInWaitList,
553-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
578+
pMemory, patternSize, pPattern, size, waitListView,
579+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
554580
return UR_RESULT_SUCCESS;
555581
} catch (...) {
556582
return exceptionToResult(std::current_exception());
@@ -572,9 +598,12 @@ ur_result_t urCommandBufferAppendMemBufferFillExp(
572598
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
573599
pSyncPointWaitList, numSyncPointsInWaitList);
574600

601+
wait_list_view waitListView =
602+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
603+
575604
UR_CALL(commandListLocked->appendMemBufferFill(
576-
hBuffer, pPattern, patternSize, offset, size, numSyncPointsInWaitList,
577-
eventsWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
605+
hBuffer, pPattern, patternSize, offset, size, waitListView,
606+
hCommandBuffer->createEventIfRequested(pSyncPoint)));
578607

579608
return UR_RESULT_SUCCESS;
580609
} catch (...) {
@@ -598,8 +627,11 @@ ur_result_t urCommandBufferAppendUSMPrefetchExp(
598627
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
599628
pSyncPointWaitList, numSyncPointsInWaitList);
600629

630+
wait_list_view waitListView =
631+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
632+
601633
UR_CALL(commandListLocked->appendUSMPrefetch(
602-
pMemory, size, flags, numSyncPointsInWaitList, eventsWaitList,
634+
pMemory, size, flags, waitListView,
603635
hCommandBuffer->createEventIfRequested(pSyncPoint)));
604636

605637
return UR_RESULT_SUCCESS;
@@ -622,8 +654,11 @@ ur_result_t urCommandBufferAppendUSMAdviseExp(
622654
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
623655
pSyncPointWaitList, numSyncPointsInWaitList);
624656

657+
wait_list_view waitListView =
658+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
659+
625660
UR_CALL(commandListLocked->appendUSMAdvise(
626-
pMemory, size, advice, numSyncPointsInWaitList, eventsWaitList,
661+
pMemory, size, advice, waitListView,
627662
hCommandBuffer->createEventIfRequested(pSyncPoint)));
628663

629664
return UR_RESULT_SUCCESS;
@@ -672,15 +707,19 @@ ur_result_t urCommandBufferAppendNativeCommandExp(
672707
auto eventsWaitList = hCommandBuffer->getWaitListFromSyncPoints(
673708
pSyncPointWaitList, numSyncPointsInWaitList);
674709

675-
UR_CALL(commandListLocked->appendEventsWaitWithBarrier(
676-
numSyncPointsInWaitList, eventsWaitList, nullptr));
710+
wait_list_view waitListView =
711+
wait_list_view(eventsWaitList, numSyncPointsInWaitList);
712+
713+
UR_CALL(
714+
commandListLocked->appendEventsWaitWithBarrier(waitListView, nullptr));
677715

678716
// Call user-defined function immediately
679717
pfnNativeCommand(pData);
680718

719+
wait_list_view emptyWaitList = wait_list_view(nullptr, 0);
681720
// Barrier on all commands after user defined commands.
682721
UR_CALL(commandListLocked->appendEventsWaitWithBarrier(
683-
0, nullptr, hCommandBuffer->createEventIfRequested(pSyncPoint)));
722+
emptyWaitList, hCommandBuffer->createEventIfRequested(pSyncPoint)));
684723

685724
return UR_RESULT_SUCCESS;
686725
}

0 commit comments

Comments
 (0)