Skip to main content

Procedural Update in Talend Open Studio

For optimal performance, have the database do most of the work when updating.  Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job.


SQL Update or Delete

To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow.  In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job.

The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow.

Procedural Coding

When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm.  In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate.  These component break a data flow up -- say from an Access datasource -- into distinct iterations.  Each iteration can drive additional processing like keeping a counter of the records processed in a current iteration.

In this post, a table "UpdateTestSourceTable" is loaded with 5 records.  Each record contains a RECORD_ID which is a business record key that is not unique.  An auto-increment DB_RECORD_ID is used for uniqueness.  SEC_ROW is a column that defines ordering within a RECORD_ID.  Initially, this is empty.  The job in this post will set the value as in the screenshot.

Data Used in Example

This job implements the following procedural algorithm.  RECORD_ID is a business record key with meaning, but is not unique in the table.  DB_RECORD_ID is a surrogate key that guarantees uniqueness.

Get a list of distinct RECORD_IDS from a table
For each RECORD_ID 
    Initialize a COUNTER variable 
    Find a list of records matching RECORD_IDS (Get "DB_RECORD_ID")
    Map the fields
    Increment the COUNTER variable
    Update the table using COUNTER constrained by DB_RECORD_ID

The job uses a tAccessInput "Get Rec Ids" to start the processing.  tAccessInput goes into a tFlowToIterate.  For each iteration, the next five components are called.
Job Loading Records by Iteration

 Lookup Rec is a second tAccessInput that runs a query returning the distinct DB_RECORD_IDs.  It uses row1.record_id as a parameter.  Here is the SQL with the parameter.

SQL Parameterized with a tFlowToIterate's Variable
 The Update component is a tAccessOutput which has "Action on data" set to Update.  The schema is a subset of UpdateTestSourceTable: only db_record_id (key) and seq_row.

For learning about how to set up a counter in a Talend job, read Running Count in Talend Open Studio. "Set Sec Num" is set to 1 at the stat of each RECORD_ID iteration; this sets the variable "counter".  The code behind the tJavaRow increments the counter and maps each input field to the output.

output_row.db_record_id = input_row.db_record_id;
output_row.sec_row = input_row.sec_row;

int counter = ((Integer)globalMap.get("counter"));
globalMap.put("counter", ++counter);


For a database like Oracle RAC, nothing beats a set-based operation like MERGE.  However, it may be too difficult if not impossible to render a particular requirement in SQL.  In these cases, use the procedural facilities (tLoop, etc.) in Talend Open Studio to better interpret the algorithm.  Watch for performance!  If you do too much work in a single iteration, your job might not scale.

Additional Screen Shots

The following screen shots are for (in order) the configuration of "Get Rec Ids", "Lookup Rec", and "Update".

tAccessInput "Get Rec Ids" Config

tAccessInput "Lookup Rec" Config
 See the above screenshot "SQL Parameterized" for the embedded SQL statement in "Lookup Rec".

tAccessOutput "Update" Config
  The following is the tMap used in the job.

tMap Config
Excel Input (not Database) Walkthrough

In most cases, it's best to get the data into an RDBMS as quickly as possible and leave the flat files -- including Excel files -- saved for historical records.  The RDBMS gives the ultimate in flexibility and performance.  However, it is possible to do complex processing using Talend components like tAggregateRow or tFilterRow in place of SQL.

Excel Output

This screenshow shows a job using a tExcelFileOutput.  It was created using 4.2.1.  The older format (excel2007 unchecked) was also tested.  The Component View is showing the configuration for the tExcelFileOutput component.

Job Using a tExcelFileOutput

Comments

Popular posts from this blog

ODI KM Adding Order by Option

You can add Order by statement to queries by editing KM.I have edited IKM SQL Control Append to provide Order by.  1) Add an option to KM named USE_ORDER_BY, its type is Checkbox and default value is False. This option determines you want an order by statement at your query. 2)Add second option to KM named ORDER_BY, type is Text. You will get order by values to your query by this option. 3) Editing Insert New Rows detail of KM. Adding below three line code after having clause. That's it! <% if (odiRef.getOption("USE_ORDER_ BY").equals("1")) { %> ORDER BY <%=odiRef.getOption("ORDER_BY" )%> <%} %>  If USE_ORDER_BY option is not used, empty value of ORDER_BY option get error. And executions of KM appears as such below; At this execution, I checked the KM to not get errors if ORDER_BY option value is null. There is no prove of ORDER BY I'm glad.  Second execution to get  Ord...

Creating Yellow Interface in ODI

Hello everyone! In Oracle data integrator (ODI), an  interface  is an object which populates one datastore, called the  target , with data coming from one or more other datastores, known as  sources . The fields of the source datastore are linked to those in the target datastore using the concept of  Mapping . Temporary interfaces used in ODI are popularly known as  Yellow Interfaces . It is because ODI generates a yellow icon at the time of creation of a yellow interface as opposed to the blue icon of a regular interface. The advantage of using a yellow interface is to avoid the creation of  Models each time you need to use it in an interface. Since they are temporary, they are not a part of the data model and hence don’t need to be in the Model. So let’s begin and start creating our yellow interface! Pre-requisites : Oracle 10g Express Edition with *SQL Plus, Oracle Data Integrator 11g. Open *SQL Plus and create a new table  Sales ...

Synchronous and Asynchronous execution in ODI

In data warehouse designing, an important step is to deciding which step is before/after. Newly added packages and required DW data must be analyzed carefully. Synchronous addings can lengthen ETL duration. Interfaces, procedures without generated scenario cannot be executed in parallel. Only scenario executions can be parallel in ODI. Default scenario execution is synch in ODI. If you want to set a scenario to executed in parallel then you will write “-SYNC_MODE=2″ on command tab or select Synchronous / Asynchronous option Asynchronous in General tab. I have created a package as interfaces executes as; INT_JOBS parallel  INT_REGIONS synch  INT_REGIONS synch  INT_COUNTRIES synch  INT_LOCATIONS parallel  INT_EMPLOYEES parallel (Interfaces are independent.) Selecting beginning and ending times and durations from repository tables as ODI 11g operator is not calculating these values. It is obvious in ODI 10g operator. SELECT    sess_no...