Multithreading in the dataxform Utility
The dataxform utility is almost always I/O bound. You can reduce end-to-end run time by configuring the utility to transform multiple streams of data concurrently, in separate kernel threads. dataxform can be multi-threaded in two dimensions:
-
File concurrency — dataxform can transform up to 32 files in concurrent execution threads. Each time a kernel thread finishes transforming a file, it informs the user component, which responds with a command to transform the next file in its work list. Number of threads is set with the
--thd
option. -
File chunking — You can also configure the kernel component to divide individual files into chunks and transform up to 16 chunks concurrently. The chunk size defaults to 128 KB, but you can adjust it using the
-- buf_size
option.
File concurrency is useful for transforming large numbers of files, but less-so with file sets that consist of a few large files. For the latter, chunking is typically more advantageous.
Concurrent transformation reduces run time, and therefore the period during which protected files are unavailable to applications. On the other hand, because files undergoing transformation at the moment of a system crash must be recovered from a backup, more active files means more time-consuming post-run recovery. Moreover, more concurrent transformation activity consumes more processing, memory, and I/O resources, which are unavailable to other applications running concurrently.