Skip to main content

The Regex Lookup Table with Talend Open Studio

 In Talend Open Studio, most matching and joining will be done using well-defined keys like integers (surrogate keys) or business keys (username) using equality.  Sometimes, it's useful to match on regular expressions.  Instead of embedding regular expressions in your tMaps, consider using a more general solution based on java.util.HashMap.

When mapping in Talend Open Studio, you may want to use regular expressions to match a lookup table.  You can apply a regular expression in a tMap, but often there is a need to apply several expressions.  This can lead to a cumbersome structure that isn't easy to reuse in a Job.

For example, the following code shows an ever-increasing tangle of logic that becomes embedded in a component like tMap.

// accruing more and more logic / subject to change
if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Fred.*") ) {
   output_row.VALUE = "LOCAL_GOVERNMENT";
} if( input_row.SYSTEM_FUND_SOURCE_NM.matches("Washington") ) {
   output_row.VALUE = "LOCAL_GOVERNMENT";
} else if (input_row.SYSTEM_FUND_SOURCE_NM.matches("[Ff][Ee][Dd].*") ) {
   output_row.VALUE = "FEDERAL_GOVERNMENT";
}
// ... etc.


It's easier to maintain this type of mapping in a database table or spreadsheet like this.

Lookup Table of Regular Expression Mappings

The typical tMap or tJoin component works well if you want to use a single value with the equality operator.  However, there's no way to compare a regular expression.  You can't switch the "=" out for a "matches()".

Using custom Java code, you can set up a Java data structure called a Map (implemented as a HashMap).  The key of the Map will be a regular expression and the value will be the looked-up value.  The Map will be loaded with the contents of the database table or spreadsheet.  As the source data is read in, more custom Java code will be used to examine the Map and find the appropriate value.

Regex Lookup Job
Regex Lookup Job contains three subjobs.  The first subjob has a single tJava component that creates the Java Map and imports Java libraries.  The second subjob reads an Excel spreadsheet into the Map using a tFileInputExcel and tJavaRow component.  Lastly, a third subjob reads the data -- from an Excel spreadhseet -- and merges it with the regex-powered lookups with another tJavaRow; the data is then printed with tLogRow.

Subjob 1 - Initialize

The following code is added to the Basic and Advanced Settings tab of a tJava component.  The first block creates a Java data structure called a Map using Java generics.  The Advanced Settings include the required imports.

// Basic settings
Map<String, String> regexMap =
new HashMap<String, String>();

globalMap.put("regexMap", regexMap);

// Advanced settings
import java.util.Map;
import java.util.HashMap;


Subjob 2 Load Lookup Table

Subjob 2 uses a tFileInputExcel and a tJavaRow to read the Excel spreadsheet into the Java Map.  The tFileInputExcel is the lookup spreadsheet shown at the start of this post and its schema is identical to what is shown.

The tJavaRow component will put a FUND_SOURCE_NM in the Map for each of the regular expressions.  The regular expression will serve as a key.


Map<String, String> regexMap =
(Map<String, String>)globalMap.get("regexMap");
 if( row1.REGEXP != null ) {
  regexMap.put( row1.REGEXP, row1.FUND_SOURCE_NM );
}


Subjob 3 Load the Data

The last subjob will take the data and look up the data using the Map created in the first two subjobs.  The subjob starts with a tFileInputExcel, which is the source data.  That feeds into a tJavaRow which will map all of its fields to the output using the Generate Code feature.  An additional field is defined that will contain the value pulled from the Map.

This is the schema behind the tJavaRow component in the third subjob.

tJavaRow Schema with Extra Value Column
 The code behind the tJavaRow will use the generated mapping of input to output combined with the extra column which will hold the looked up value, the result of matching a regular expression.  This functionality will select the first match; this could be adjusted for a more complicated or variable algorithm.

output_row.SYSTEM_FUND_SOURCE_NM = input_row.SYSTEM_FUND_SOURCE_NM;
output_row.FUND_NM = input_row.FUND_NM;
output_row.RESTRICTED = input_row.RESTRICTED;
output_row.REVENUES = input_row.REVENUES;

output_row.VALUE = "";

Map<String, String> regexMap = (Map<String, String>)globalMap.get("regexMap");
for( String key : regexMap.keySet() ) {
 if( input_row.SYSTEM_FUND_SOURCE_NM != null ) {
  if( input_row.SYSTEM_FUND_SOURCE_NM.matches(key) ) {
   output_row.VALUE = regexMap.get(key);
   break;
  }
 }
}


Here's the output from running the job.

Run of Regex Lookup Table Job
If you have a simple String to match in Talend Open Studio, embed a regular expression in components like tMap.  However, as the matching becomes more complicated -- particularly if a long list of matches starts to form -- consider maintaining a table of expressions and lookup values.  This example uses a few simple subjobs to set up a tables that can be maintained separate from the processing.

Comments

Popular posts from this blog

ODI KM Adding Order by Option

You can add Order by statement to queries by editing KM.I have edited IKM SQL Control Append to provide Order by.  1) Add an option to KM named USE_ORDER_BY, its type is Checkbox and default value is False. This option determines you want an order by statement at your query. 2)Add second option to KM named ORDER_BY, type is Text. You will get order by values to your query by this option. 3) Editing Insert New Rows detail of KM. Adding below three line code after having clause. That's it! <% if (odiRef.getOption("USE_ORDER_ BY").equals("1")) { %> ORDER BY <%=odiRef.getOption("ORDER_BY" )%> <%} %>  If USE_ORDER_BY option is not used, empty value of ORDER_BY option get error. And executions of KM appears as such below; At this execution, I checked the KM to not get errors if ORDER_BY option value is null. There is no prove of ORDER BY I'm glad.  Second execution to get  Ord

Creating Yellow Interface in ODI

Hello everyone! In Oracle data integrator (ODI), an  interface  is an object which populates one datastore, called the  target , with data coming from one or more other datastores, known as  sources . The fields of the source datastore are linked to those in the target datastore using the concept of  Mapping . Temporary interfaces used in ODI are popularly known as  Yellow Interfaces . It is because ODI generates a yellow icon at the time of creation of a yellow interface as opposed to the blue icon of a regular interface. The advantage of using a yellow interface is to avoid the creation of  Models each time you need to use it in an interface. Since they are temporary, they are not a part of the data model and hence don’t need to be in the Model. So let’s begin and start creating our yellow interface! Pre-requisites : Oracle 10g Express Edition with *SQL Plus, Oracle Data Integrator 11g. Open *SQL Plus and create a new table  Sales  in Oracle. You can use any existing ta

Running Count in Talend Open Studio

Most Talend components keep a count of the records processed using variables like NB_LINE or NB_LINE_OK.  But these are only available after all processing is completed.  Define your own counter variable to keep a running count for use in a tMap. Variables like tFilterRow.NB_LINE or tAccessOutput.NB_LINE_INSERTED can be used to report the number of affected lines after a subjob's processing.  However, it may be of use to get the current line index for use in a tMap.  The index variables used to form NB_LINE aren't available during processing; they're only written out the globalMap at the end of processing. In this example, staging records are loaded from Excel to Access.  The order in which the Excel records are read is preserved in a database column called DISPLAY_SEQ_NB.  Note that there is an auto-increment column used for record ID in the Access table.  This could be used to infer a loading order, but this job uses a separate column to keep the ID as a meaningless surr