Skip to main content

Posts

Showing posts from August, 2014

Parsing a String using Talend Open Studio's tExtractRegexFields Component

A reader asked how to extract the bond type "CD" from the following input string: "CD Corporation du 20/12/2010 4.5% à 26 semaines".  Although it's easy to grab the first two characters in a tMap using a substring function, there is an off-the-shelf component tExtractRegexFields that can handle varying lengths. This job uses a tFixedFlowInput to provide data.  The tFixedFlowInput is run into a tExtractRegexFields which breaks the input into two strings: investment type and remainder.  The tExtractRegexFields is connected to a tMap which filters the columns.  The result is output to tLogRow. Input Data for Regex-parsing Job The test data consists of four records representing four French investment types: CD, OAT, BTF, BTAN.  Types vary in character length: 2, 3, 4.  While a simple substring() in a tMap is a quick way to pluck the first two characters off of a string, that solution won't work for the varying characters.  Regular expressions can be used t

Manipulating a tHashOutput in Talend Open Studio

The Talend Open Studio tHashOutput and tHashInput allow you to save your input in RAM, offering potential performance gains.  The basic usage defines a single tHashOutput which gathers input and a tHashInput which will direct the input to a data flow.  This post describes two expanded configurations. tHashOutput and tHashInput worked with input stored in internal memory and do so in a way consistent with other Talend components.  The Hash components allow you to define flows to retrieve data throughout a map that has been stored by some other part of the job.  In a simple scenario, this is done with a single input/output pair. Multiple Sources This screenshot shows a job that will merge two data sources -- a tRowGenerator and a tFileInputDelimited -- into a single Hash data structure using two tHashOutputs.  The first tHashOutput will be referenced by subsequent tHashOutputs in the "Link with a tHashOutput" control. Configuration of Linked tHashOutput This tHashOutput r

Talend Open Studio Tip: Naming a Subjob

When selecting a subjob in Talend Open Studio, the Basic Settings tab provides a text box that can display a title associated with the subjob. Up to this point, I've been adding a Note above or to the right of a subjob to document the subjob's purpose.  However, you can provide a short description without the additional component using the Basic Settings tab.  Select a subjob (click in the shaded area surround the subjob components) and enter some text. Naming a Subjob

Calling a Stored Procedure from Talend Open Studio

Writing to a set of related tables may present a timing challenge to Talend Open Studio developers.  This post shows how to use a stored procedure to write to two related MS SQL Server tables. Databases like MS SQL Server and MySQL use special identity columns to provide a unique auto-incremented value for a row.  This value is often used as a primary key.  Working with these columns presents a special challenge to Talend Open Studio developers because there may not be a business key that supports a lookup on the generated id. Related Tables The tables Business and Contact are releated by a foreign key 'businessId' which is the primary key of the Business table.  A Contact record can't be inserted without a valid foreign key.  'businessName' is not unique, so a lookup cannot be established after-the-fact. Two Tables Related by an Auto Increment Primary Key Get Last ID Talend provides components for retrieving the last auto-generated id: tMSSqlLastInsertId, tMy

Procedural Update in Talend Open Studio

For optimal performance, have the database do most of the work when updating.  Sometimes, this won't be feasible and you'll need to create loops and parameterize queries in a Talend Open Studio job. SQL Update or Delete To execute a single SQL UPDATE or DELETE statement, use the t*Row components in the Databases family: tOracleRow, tPostgresqlRow, tMysqlRow.  In the Query box in "Basic settings", enter a SQL string joined with variables taken from the globalMap of the job. The t*Row components can be called as standalone subjobs, in an iteration, or as the target of a flow. Procedural Coding When it isn't practical to render a data transformation in SQL, use a procedure to more cleanly express the algorithm.  In Talend Open Studio, this is implemented using components like tLoop and tFlowToIterate.  These component break a data flow up -- say from an Access datasource -- into distinct iterations.  Each iteration can drive additional processing like keeping a c

To Iterate or Flow in Talend Open Studio

When working with RDBMS or Web Services components, Talend Open Studio jobs use Flows.  For systems programming -- file operations, etc. -- jobs use Iterators.  Use a Flow where possible, but keep things as Iterate if most of the components are Iterate connection-based. In Talend Open Studio, looping or repeating actions are formed using either Flows or Iterators.  To bridge the two, there are adapter components: tIterateToFlow, tFlowToIterate. Take the following job as an example Two Forms of Repeating Actions in TOS Iterate Two tFileLists are used to drive the processing (a System.out.println).  The first tFileList uses the Iterate connector fed into a tJava.  For each file found in the tFileList, the tJava is invoked.  No data is sent to the tJava.  The tJava accesses a global variable available after each tFileList iteration. tJava Code Iterate Connector Flow To convert the list of files into a Flow where each file is a row, use the tIterateToFlow adapter.  This requires

Efficient Lookups with Talend Open Studio's Hash Components

If you are using the same data source in several Talend Open Studio subjobs, consider loading the data data into an internal data structure for use throughout a job. A HashMap is a Java data structure maintained in RAM.  Talend Open Studio lets you load manipulate data in HashMaps using the tHashInput and tHashOutput components.  Because the data exists in RAM, lookups are faster, especially if secondary storage is based on spinning hard disks. Without a HashMap This Talend Open Studio job loads data from two spreadsheets, EmployeeHires and EmployeeTerminations, into a target table, EmployeeActions.  The spreadsheet sources contain a data (hireDate and terminationDate) that is used as a key into a table called BusinessDates.  Although the date could simply be carried over into the target table (without the lookup), many data warehouses maintain date information in a separate table.  This is because calendar-related business information is merged with the timestamp. This data includes