LLFIO
v2.00
|
Work group within the global dynamic thread pool. More...
#include "dynamic_thread_pool_group.hpp"
Classes | |
class | io_aware_work_item |
A work item which paces when it next executes according to i/o congestion. More... | |
class | work_item |
An individual item of work within the work group. More... | |
Public Member Functions | |
virtual result< void > | submit (span< work_item * > work) noexcept=0 |
Threadsafe. Submit one or more work items for execution. Note that you can submit more later. More... | |
result< void > | submit (work_item *wi) noexcept |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
template<class T , typename std::enable_if<(!std::is_pointer< T >::value), bool >::type = true, typename std::enable_if<(std::is_base_of< work_item, T >::value), bool >::type = true> | |
result< void > | submit (span< T > wi) noexcept |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
virtual result< void > | stop () noexcept=0 |
Threadsafe. Cancel any remaining work previously submitted, but without blocking (use wait() to block). | |
virtual bool | stopping () const noexcept=0 |
Threadsafe. True if a work item reported an error, or stop() was called, but work items are still running. | |
virtual bool | stopped () const noexcept=0 |
Threadsafe. True if all the work previously submitted is complete. | |
virtual result< void > | wait (deadline d={}) const noexcept=0 |
Threadsafe. Wait for work previously submitted to complete, returning any failures by any work item. | |
template<class Rep , class Period > | |
result< bool > | wait_for (const std::chrono::duration< Rep, Period > &duration) const noexcept |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
template<class Clock , class Duration > | |
result< bool > | wait_until (const std::chrono::time_point< Clock, Duration > &timeout) const noexcept |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
Static Public Member Functions | |
static const char * | implementation_description () noexcept |
A textual description of the underlying implementation of this dynamic thread pool group. More... | |
static size_t | current_nesting_level () noexcept |
Returns the work item nesting level which would be used if a new dynamic thread pool group were created within the current work item. | |
static work_item * | current_work_item () noexcept |
Returns the work item the calling thread is running within, if any. | |
static uint32_t | ms_sleep_for_more_work () noexcept |
Returns the number of milliseconds that a thread is without work before it is shut down. Note that this will be zero on all but on Linux if using our local thread pool implementation, because the system controls this value on Windows, Grand Central Dispatch etc. | |
static uint32_t | ms_sleep_for_more_work (uint32_t v) noexcept |
Sets the number of milliseconds that a thread is without work before it is shut down, returning the value actually set. More... | |
Friends | |
class | dynamic_thread_pool_group_impl |
Work group within the global dynamic thread pool.
Some operating systems provide a per-process global kernel thread pool capable of dynamically adjusting its kernel thread count to how many of the threads in the pool are currently blocked. The platform will choose the exact strategy used, but as an example of a strategy, one might keep creating new kernel threads so long as the total threads currently running and not blocked on page faults, i/o or syscalls, is below the hardware concurrency. Similarly, if more threads are running and not blocked than hardware concurrency, one might remove kernel threads from executing work. Such a strategy would dynamically increase concurrency until all CPUs are busy, but reduce concurrency if more work is being done than CPUs available.
Such dynamic kernel thread pools are excellent for CPU bound processing, you simply fire and forget work into them. However, for i/o bound processing, you must be careful as there are gotchas. For non-seekable i/o, it is very possible that there could be 100k handles upon which we do i/o. Doing i/o on 100k handles using a dynamic thread pool would in theory cause the creation of 100k kernel threads, which would not be wise. A much better solution is to use an byte_io_multiplexer
to await changes in large sets of i/o handles.
For seekable i/o, the same problem applies, but worse again: an i/o bound problem would cause a rapid increase in the number of kernel threads, which by definition makes i/o even more congested. Basically the system runs off into pathological performance loss. You must therefore never naively do i/o bound work (e.g. with memory mapped files) from within a dynamic thread pool without employing some mechanism to force concurrency downwards if the backing storage is congested.
Instances of this class contain zero or more work items. Each work item is asked for its next item of work, and if an item of work is available, that item of work is executed by the global kernel thread pool at a time of its choosing. It is NEVER possible that any one work item is concurrently executed at a time, each work item is always sequentially executed with respect to itself. The only concurrency possible is across work items. Therefore, if you want to execute the same piece of code concurrently, you need to submit a separate work item for each possible amount of concurrency (e.g. std::thread::hardware_concurrency()
).
You can have as many or as few items of work as you like. You can dynamically submit additional work items at any time, except when a group is currently in the process of being stopped. The group of work items can be waited upon to complete, after which the work group becomes reset as if back to freshly constructed. You can also stop executing all the work items in the group, even if they have not fully completed. If any work item returns a failure, this equals a stop()
, and the next wait()
will return that error.
Work items may create sub work groups as part of their operation. If they do so, the work items from such nested work groups are scheduled preferentially. This ensures good forward progress, so if you have 100 work items each of which do another 100 work items, you don't get 10,000 slowly progressing work. Rather, the work items in the first set progress slowly, whereas the work items in the second set progress quickly.
work_item::next()
may optionally set a deadline to delay when that work item ought to be processed again. Deadlines can be relative or absolute.
As with elsewhere in LLFIO, as a low level facility, we don't implement https://wg21.link/P0443 Executors, but it is trivially easy to implement a dynamic equivalent to std::static_thread_pool
using this class.
On Microsoft Windows, the Win32 thread pool API is used (https://docs.microsoft.com/en-us/windows/win32/procthread/thread-pool-api). This is an IOCP-aware thread pool which will dynamically increase the number of kernel threads until none are blocked. If more kernel threads are running than twice the number of CPUs in the system, the number of kernel threads is dynamically reduced. The maximum number of kernel threads which will run simultaneously is 500. Note that the Win32 thread pool is shared across the process by multiple Windows facilities.
Note that the Win32 thread pool has built in support for IOCP, so if you have a custom i/o multiplexer, you can use the global Win32 thread pool to execute i/o completions handling. See CreateThreadpoolIo()
for more.
No dynamic memory allocation is performed by this implementation outside of the initial make_dynamic_thread_pool_group()
. The Win32 thread pool API may perform dynamic memory allocation internally, but that is outside our control.
Overhead of LLFIO above the Win32 thread pool API is very low, statistically unmeasurable.
If not on Linux, you will need libdispatch which is detected by LLFIO cmake during configuration. libdispatch is better known as Grand Central Dispatch, originally a Mac OS technology but since ported to a high quality kernel based implementation on recent FreeBSDs, and to a lower quality userspace based implementation on Linux. Generally libdispatch should get automatically found on Mac OS without additional effort; on FreeBSD it may need installing from ports; on Linux you would need to explicitly install libdispatch-dev
or the equivalent. You can force the use in cmake of libdispatch by setting the cmake variable LLFIO_USE_LIBDISPATCH
to On.
Overhead of LLFIO above the libdispatch API is very low, statistically unmeasurable.
On Linux only, we have a custom userspace implementation with superior performance. A similar strategy to Microsoft Windows' approach is used. We dynamically increase the number of kernel threads until none are sleeping awaiting i/o. If more kernel threads are running than three more than the number of CPUs in the system, the number of kernel threads is dynamically reduced. Note that all the kernel threads for the current process are considered, not just the kernel threads created by this thread pool implementation. Therefore, if you have alternative thread pool implementations (e.g. OpenMP, std::async
), those are also included in the dynamic adjustment.
As this is wholly implemented by this library, dynamic memory allocation occurs in the initial make_dynamic_thread_pool_group()
and per thread creation, but otherwise the implementation does not perform dynamic memory allocations.
After multiple rewrites, eventually I got this custom userspace implementation to have superior performance to both ASIO and libdispatch. For larger work items the difference is meaningless between all three, however for smaller work items I benchmarked this custom userspace implementation as beating (non-dynamic) ASIO by approx 29% and Linux libdispatch by approx 52% (note that Linux libdispatch appears to have a scale up bug when work items are small and few, it is often less than half the performance of LLFIO's custom implementation).
|
inlinestaticnoexcept |
A textual description of the underlying implementation of this dynamic thread pool group.
The current possible underlying implementations are:
Which one is chosen depends on what was detected at cmake configure time, and possibly what the host OS running the program binary supports.
|
inlinestaticnoexcept |
Sets the number of milliseconds that a thread is without work before it is shut down, returning the value actually set.
Note that this will have no effect (and thus return zero) on all but on Linux if using our local thread pool implementation, because the system controls this value on Windows, Grand Central Dispatch etc.
|
pure virtualnoexcept |
Threadsafe. Submit one or more work items for execution. Note that you can submit more later.
Note that if the group is currently stopping, you cannot submit more work until the group has stopped. An error code comparing equal to errc::operation_canceled
is returned if you try.