Skip to main content

Talend Job Design - Performance Optimization Tips



I am going to share few of the Performance tuning tips that I follows while designing Talend Job. Let me know your comments on the same and also let me know, if there are any other performance optimization methods you follows and are helpful.

Here we go..



1.Remove Unnecessary fields/columns ASAP using tFilterColumns component.









It is very important to remove the data from the Job flow which is not required as soon as possible. e.g. we have a huge lookup file having more than 20 fields but we only need two fields (Key, Value) while performing the lookup operation. Now if we do not filter the columns before join then the whole file will be read into memory for performing lookup hence occupying unnecessary space. However, if we filter fields and only keep two required columns then the memory occupied by lookup data is much less i.e. in this example 10 times less.




2. Remove Unnecessary data/records ASAP using tFilterRows component.







Similarly, It is necessary to remove the data from the job flow which is not required in the Job. Having less data in your job flow will always allow your Talend Job to perform better.




3. Use Select Query to retrieve data from database - When retrieving data from database, it is recommended to use the select query in the t<DB>Input component e.g. tMySQlInput, tOracleInput etc to select only required data. In the select query itself you can provide the required fields to fetch and also provide the where condition and filter only required data. This will allow only required data to be fetched in the job flow rather than complete table unload.









4. Use Database Bulk components - While loading huge datasets to database from Talend Job, it is recommended to use Bulk components provided by Talend for almost all databases. For more details and demonstration of performance optimization using Bulk components click here.









5. Store on Disk Option - There can be several possible reasons for a low performance of Job. Most common reason may include:




Running a Job which contains a number of buffer components such as tSortRow, tFilterRow, tMap, tAggregateRow, tHashOutput for example
Running a Job which processes a very large amount of data.




In Jobs that contain buffer components such as tSortRow as well as tMap, you can change the basic configuration to store temporary data on disk rather than in memory. For example, tMap, select the option Store on disk for lookup data to be stored on a defined path. This will allow not to take the whole data into memory which will keep the memory available for operations and temp data will be fetched from disk.









6. Allocating more memory to the Jobs - If you cannot optimize the Job design, you can at least allocate more memory to the Job. Allocating more memory to job will allow the job to perform better.




-Xms signifies the initial heap size of the Job.
-Xmx signified the maximum size to which heap can grow. (maximum memory allocated to Job)








7. Parallelism - Most of the time we need to run few jobs/sub jobs in parallel to maximize the performance and reduce overall job execution time. However, Talend doesn’t automatically execute the subjobs in Parallel. E.g. If we have a Job which loads two different tables from two different files and there is no dependency between both loads then Talend will not automatically execute the Jobs in parallel. Talend will execute one of the sub job(randomly) and when one is finished then it start execution of the second subjob. You can achieve the parallelization in following two ways:




Using the tParallelize component of Talend. (only available in Talend Integration Suite)
Running SubJobs in Parallel by using the Multithreaded Executions. This option is also available in Talend Open Studio. However, this option is disabled by default. You can enable this option from Job view. Visit the article “Parallel Execution Sub Jobs in Talend Open Studio” for more details and demonstration of Parallel execution of Sub Jobs in Talend Open Studio.




8. Use Talend ELT Components when required - ETL components are very handy and helps to optimize performance of the job when we need to perform transformation on data within a single database. There are couple of scenarios where we can use ELT components e.g. performing a join between the data in different table in same database. Benefit of using ELT component is that It will not unload the data from database tables into Job flow for performing the transformations. However, it will Talend will automatically create Insert/Select statements which will directly run on DB server. So if the database tables are indexed properly and data is huge then ELT method can provide to be much better option in terms of performance of the Job. For more details on ELT components click here.









9. Use SAX parser over Dom4J whenever required - When parsing Huge XML files try using the SAX parser in the Generation mode in the Advanced Settings of tFileInputXML component. However SAX parser comes with few downsides e.g. we can only basic XPATH expression and can not use expressions like Last , array selection of data [ ] etc. But if your requirement is getting accomplished using SAX parser, you must prefer it over Dom4J.




Visit the article “Handling Huge XML files in Talend”, for demonstration of performance optimization of SAX parser.
Visit the article “Difference between Dom4J and SAX parser in Talend”, for detailed difference between Dom4J and SAX parser.




10. Index Database Table columns - When updating the data in a Table through Talend Job, it is recommended to index the database table columns on the same fields which is defined as Key in the Talend Database output component. Having the index defined on the key will allow the job to run much faster as compared to non indexed keys.














11. Split Talend Job to smaller Subjobs - Whenever possible, one should split the complex Talend job to smaller Subjobs. Talend operates pipe line parallelism i.e. after processing few records it passes to downstream components even if the previous component has finished processing all records. Hence if we will design a JOb having complex number of operations in single subjob then the performance of the job will reduce. It is advisable to bread the complex Talend job to smaller Subjobs and then control the flow of Job using Triggers in Talend.


Thanks Guys for reading this post. I am looking forward to your expert comments.

Comments

Popular posts from this blog

ODI KM Adding Order by Option

You can add Order by statement to queries by editing KM.I have edited IKM SQL Control Append to provide Order by.  1) Add an option to KM named USE_ORDER_BY, its type is Checkbox and default value is False. This option determines you want an order by statement at your query. 2)Add second option to KM named ORDER_BY, type is Text. You will get order by values to your query by this option. 3) Editing Insert New Rows detail of KM. Adding below three line code after having clause. That's it! <% if (odiRef.getOption("USE_ORDER_ BY").equals("1")) { %> ORDER BY <%=odiRef.getOption("ORDER_BY" )%> <%} %>  If USE_ORDER_BY option is not used, empty value of ORDER_BY option get error. And executions of KM appears as such below; At this execution, I checked the KM to not get errors if ORDER_BY option value is null. There is no prove of ORDER BY I'm glad.  Second execution to get  Ord...

Creating Yellow Interface in ODI

Hello everyone! In Oracle data integrator (ODI), an  interface  is an object which populates one datastore, called the  target , with data coming from one or more other datastores, known as  sources . The fields of the source datastore are linked to those in the target datastore using the concept of  Mapping . Temporary interfaces used in ODI are popularly known as  Yellow Interfaces . It is because ODI generates a yellow icon at the time of creation of a yellow interface as opposed to the blue icon of a regular interface. The advantage of using a yellow interface is to avoid the creation of  Models each time you need to use it in an interface. Since they are temporary, they are not a part of the data model and hence don’t need to be in the Model. So let’s begin and start creating our yellow interface! Pre-requisites : Oracle 10g Express Edition with *SQL Plus, Oracle Data Integrator 11g. Open *SQL Plus and create a new table  Sales ...

Synchronous and Asynchronous execution in ODI

In data warehouse designing, an important step is to deciding which step is before/after. Newly added packages and required DW data must be analyzed carefully. Synchronous addings can lengthen ETL duration. Interfaces, procedures without generated scenario cannot be executed in parallel. Only scenario executions can be parallel in ODI. Default scenario execution is synch in ODI. If you want to set a scenario to executed in parallel then you will write “-SYNC_MODE=2″ on command tab or select Synchronous / Asynchronous option Asynchronous in General tab. I have created a package as interfaces executes as; INT_JOBS parallel  INT_REGIONS synch  INT_REGIONS synch  INT_COUNTRIES synch  INT_LOCATIONS parallel  INT_EMPLOYEES parallel (Interfaces are independent.) Selecting beginning and ending times and durations from repository tables as ODI 11g operator is not calculating these values. It is obvious in ODI 10g operator. SELECT    sess_no...