Procedural Update in Talend Open Studio

For optimal performance, have the database do most of the work when updating. Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.

SQL Update or Delete

To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow. In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.

The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.

Procedural Coding

When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm. In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate. These component break a data flow up -- say from an Access datasource -- into distinct iterations. Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.

In this post, a table "UpdateTestSourceTable" is loaded with 5 records. Each record contains a RECORD_ID which is a business record key that is not unique. An auto-increment DB_RECORD_ID is used for uniqueness. SEC_ROW is a column that defines ordering within a RECORD_ID. Initially, this is empty. The job in this post will set the value as in the screenshot.

Data Used in Example

This job implements the following procedural algorithm. RECORD_ID is a business record key with meaning, but is not unique in the table. DB_RECORD_ID is a surrogate key that guarantees uniqueness.

Get a list of distinct RECORD_IDS from a table
For each RECORD_ID
Initialize a COUNTER variable
Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
Map the fields
Increment the COUNTER variable
Update the table using COUNTER constrained by DB_RECORD_ID

The job uses a tAccessInput "Get Rec Ids" to start the processing. tAccessInput goes into a tFlowToIterate. For each iteration, the next five components are called.

Job Loading Records by Iteration

Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs. It uses row1.record_id as a parameter. Here is the SQL with the parameter.

SQL Parameterized with a tFlowToIterate's Variable

The Update component is a tAccessOutput which has "Action on data" set to Update. The schema is a subset of UpdateTestSourceTable: only db_record_id (key) and seq_row.

For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter". The code behind the tJavaRow increments the counter and maps each input field to the output.

output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;

int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);

For a database like Oracle RAC, nothing beats a set-based operation like MERGE. However, it may be too difficult if not impossible to render a particular requirement in SQL. In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm. Watch for performance! If you do too much work in a single iteration, your job might not scale.

Additional Screen Shots

The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".

tAccessInput "Get Rec Ids" Config

tAccessInput "Lookup Rec" Config

See the above screenshot "SQL Parameterized" for the embedded SQL statement in "Lookup Rec".

tAccessOutput "Update" Config

The following is the tMap used in the job.

tMap Config

Excel Input (not Database) Walkthrough

In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records. The RDBMS gives the ultimate in flexibility and performance. However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.

Excel Output

This screenshow shows a job using a tExcelFileOutput. It was created using 4.2.1. The older format (excel2007 unchecked) was also tested. The Component View is showing the configuration for the tExcelFileOutput component.

Job Using a tExcelFileOutput

TechWare

Search This Blog

Procedural Update in Talend Open Studio

For optimal performance, have the database do most of the work when updating. Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.

Comments

Post a Comment

Popular posts from this blog

Synchronous and Asynchronous execution in ODI

ODI KM Adding Order by Option

Oracle Data Integrator Tools:OdiSqlUnload, OdiFileCopy ,OdiZip