First stress test: Transfering 6 millions of records from aFlat File, to CSV file.(VERSION 2)
A simple change in one of the Stages of the canvas, would make Datastage to lead the tests in terms of time, and only by editing this simple interface. However, it would be offensive to the other tools and rules defined for the tests, and I will explain why:
This simple test involves transfering data from a flat file to another, on the same filesystem. It is necessary not to apply changes on the physical file in order to evaluate how easily does the tool import the file. As a second premise, the Target must maintain this flat structure without transformations, to keep the interoperability, if import it to another ETL tool is needed.
In Datastage, while using the Sequential File, both as Source and Target, we stick to the rules, which implies a limit in performance. While handling huge volumes of data, the Sequential File stage can become one of the major bottlenecks as reading and writing from this stage is slow. We have other option: the" Data set" which allows us to apply parallelism, but would save data with other "structure"and "format" making difficult to be read by any other ETL tool and turning us away from the goals of our test in terms of interoperation.
A "Data set" is composed by a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiples disks in your system. A Data set is organized in partitions and segments. Each partition is stored on a single processing node. Each data segment contains all the records written by a single WebSphere Datastage job. Therefore, a segment can contain files from many partitions, and a partition has files from many segments.
The descriptor file for a data set contains the following information:
For each segment, the descriptor file contains:
For all these reasons, this test will include images of both executions, with the remarkable improvements of using the tool's features (in the version with DSs). However, the final time is conformed by the execution acording to the rules.
Sequential file to Sequential File
The most performant execution (using these Stage) was achieved by setting the number of Readers per Node 4, from which the file is partitioned in their reading.
Execution Time:36 SECS
In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one).
This is an optional property and only applies to files containing fixed-length records.
Dataset to Dataset OR Sequential to Dataset
There is a remarkable performance improvement in terms of execution times, using Datasets. The rows per second increases from160.000 to500.000.
The picture above shows 2 versions at different times, with dedicated resources, using as targets both Sequential File and Datasets to show the differences. The configuration file, is set with 4 nodes.
Execution time:11 SECS
The same execution time was obtained in both tests where the Sequential File was not including as Target:
a) Sequential File (Source) - Dataset (TARGET)
b) Dataset (Source) - Dataset (TARGET)
To sum up, whilebuilding up a DW process in productive enviroments, in order to achieve process linking using Jobs,Sequential Files would not be recomended because of performance, storage, management and other issues. However, in this particular test -and to be fair with all other tools-, evaluation will be performed with Sequentials.
Times: (Version Seq File Stage to Seq)
-Environment: Infraestructure composed of 3 nodes
- 1)ESXi 5.0:
1.a)Physical Datastore 1:VM ETLClover (12GB RAM - 2 Cores * 2 Sockets)
1.b) Physical Datastore 2:VM Database ServerMySQL/Oracle (4GB RAM - 2 Cores * 2 Sockets)
-2)Monitor Performance: VM Monitor ESXi + SQL Server 2008 (with 4 GB RAM)
-3)Operator ETL: ESXi Client (with 3 GB RAM)
CASE : SEQUENTIALS FILES vs DATASETS
-To measure elapsed time reading and writing 6 million rows, from Flat file, to .CSV file.
-Compare performances in the 2 environments.
-Analyze use of the resources
|ETL Tool||IBM Datastage 8.1|
|Design & Run||
Log de Ejecucion:
|Elapsed time(s)||36 Secs. vs 114 Secs. (V1)|
160.000 rows/sec VS 52.840 rows/sec(Test1_V1)
How to Improve
-Adjust the parameters:
- Readers per Nodes.
- La implementaciÃ³n de Data sets incrementa la Performance en gran medida,
eliminando los cuellos de botellas generados por los Sequential Files.
USE OF RESOURCES:
CPU Monitoring, "Passive and Active state" in different executions.Last Execution:23:49 - 23:50
Memory Monitoring:Last Execution:23:49 - 23:50