First stress test: Transfer 6 millions of records from a Flat File, to CSV file. Version 2.
Clover as well as Talend, in this second emision of tests with the new hardware, have shown a remarkable reduction in time. However,Pentaho took the lead respecting the initial three. The data flow of the three abovementioned (in rows/sec) between SOURCE and TARGET is similar,at around 150.000 per second. Nevertheless, PDI offers an advantage regarding the other two available options. El CSV Input Step, has a configuration parameter that allows "to run parallel", therefore, depending on the number of copies established , you will get different dataflows, each of them reading the entrance file.
As shown in the in the picture above,in our particular case the 150.000 (aprox) registers per second, multiple per the four configurated copies, obtain as a result, a quarter of the time compared to the other two.
We should bear in mind that this competitive advantage is present in the other two tools, (with different implementations) but it is not available in CE versions. For example in Clover, the reading in parallel is obtained with the Stage "Parallel Reader" which is not in the version Clover ETL Designer Community, where we only have "Universal Data Reader".
- Environment: Infraestructure composed of 3 nodes
- 1) ESXi 5.0:
1.a) Physical Datastore 1: VM ETL Clover (12GB RAM - 2 Cores * 2 Sockets)
1.b) Physical Datastore 2: VM Database Server MySQL/Oracle (4GB RAM - 2 Cores * 2 Sockets)
- 2) Monitor Performance: VM Monitor ESXi + SQL Server 2008 (with 4 GB RAM)
- 3) Operator ETL: ESXi Client (with 3 GB RAM)
CASE 1: CSV + Lazy Conv + X4 Cop + FData dump + N|O BS 1.500.000
- To measure elapsed time reading and writing 6 million rows, from Flat file, to .CSV file.
- Compare performances in the 2 environments.
- Analyze use of the resources
|ETL Tool||Pentaho (Spoon) 4.1|
|Design & Run||
Log de EjecuciÃ³n:
|Elapsed time (s)||9 Secs.|
|Rows p/s (avg)||
150.000 rows/sec (x4) VS 10.000 rows/Sec (x4) (Test1_V1)
VERSION 1: Rows por segundo en la Primer VersiÃ³n del test 1. Con un Promedio de 10.000 Filas por Segundo.
VERSION 2: En esta Segunda VersiÃ³n del test 1, Podemos ver que el flujo de datos varÃa alrededor de los
150.000 r/s, con una duraciÃ³n de 9 segundos Totales. (Teniendo en cuenta que ejcuto en 4x)
How to Improve
- Adjust the parameters:
- Use CSV -
- Use Lazy Conversion
- Use Fast Data Dump
- Set N|O BS to 1.5M
- Set 4X (Copies)
USE OF RESOURCES:
Important: Memory Swap: 0 / Network usage: 0
CPU/Datastore: CPU Usage Mhz / Datastore usage between 21:36-21:40
Menmory: After several executions, the memory consumption remains stable in 2,7 GB
CPU Monitoring, "Passive and Active state" in different executions. Last Execution:21:36 - 21:40
Memory Monitoring: Last Execution:21:36 - 21:40