Considerations with Bulk Data Transformation
For large file sets (hundreds of gigabytes or more), bulk transformation is time-consuming. Managing transformation time is important, because file set content must be frozen (inaccessible to applications) throughout the transformation process. Once transformation starts, it must continue until complete, so transformation time determines the window of data unavailability.
Two major components contribute to transformation time:
-
Number of blocks of file data — Because CTE must read, transform, and rewrite each block of file data, this component can be estimated by multiplying the number of file blocks to be transformed by the average read, transformation, and write time for a block.
-
Number of files — Because the CTE Agent transforms data file by file, each file must be “looked up,” opened, and closed during transformation, using underlying file system mechanisms. This typically requires multiple disk accesses. Therefore, file sets that consist of many small files, per-file overhead, can actually exceed file block transformation time.
Other factors, such as file system fragmentation, and load from concurrent applications, may also affect transformation time. Mainly, the number of blocks and number of files to be transformed are fundamental in that they cannot be reduced or eliminated.