A reader asked how to extract the bond type "CD" from the following input string: "CD Corporation du 20/12/2010 4.5% à 26 semaines". Although it's easy to grab the first two characters in a tMap using a substring function, there is an off-the-shelf component tExtractRegexFields that can handle varying lengths.
This job uses a tFixedFlowInput to provide data. The tFixedFlowInput is run into a tExtractRegexFields which breaks the input into two strings: investment type and remainder. The tExtractRegexFields is connected to a tMap which filters the columns. The result is output to tLogRow.
The test data consists of four records representing four French investment types: CD, OAT, BTF, BTAN. Types vary in character length: 2, 3, 4. While a simple substring() in a tMap is a quick way to pluck the first two characters off of a string, that solution won't work for the varying characters. Regular expressions can be used to handle the variety. A regular expression keys off of the repeating structure of the input data rather than the fixed positions.
Regular expression syntax can be intimidating, and is best learned by breaking down examples. Fortunately, nearly every toolkit from Javascript to Perl has an easy way to work with regular expression. In the Regex supporting this example, the investment type (CD, OAT, etc) is identified as a sequence of word characters (A-Z, a-z, 0-9). The regular expression will match the characters until it encounters one not in the set (in this case, the space character). The following ".*" will grab everything at the end.
The parenthesis characters denote the column match groups. Everything in the first set of parens -- the investment type w+ -- is the first group. The remainder is the second group. Here is the outbound schema used in the tExtractRegexFields component.
The tMap isn't essential. I'm using it to filter the remainder column.
Finally, here is the output generated when the job is run. tLogRow provides the writing.
Regular expressions are enormously power. Thankfully, they're available in most languages. The best way to learn the regular expressions is to break down examples using an interpreter like Perl or a Javascript program. Although many regular expression implementations will be the same in different environments, there is sometimes a syntactic different between a Perl and a Java regular expression. The most notable is the need to escape the important backslash ('\') character with a second backslash to form valid Java strings.
This job uses a tFixedFlowInput to provide data. The tFixedFlowInput is run into a tExtractRegexFields which breaks the input into two strings: investment type and remainder. The tExtractRegexFields is connected to a tMap which filters the columns. The result is output to tLogRow.
Input Data for Regex-parsing Job |
tExtractRegexFields |
The parenthesis characters denote the column match groups. Everything in the first set of parens -- the investment type w+ -- is the first group. The remainder is the second group. Here is the outbound schema used in the tExtractRegexFields component.
Schema Used by tExtractRegexFields |
A tMap Filtering a Column |
Input String Parsed to Define an Investment Type |
Do you know if there's a way to utilize tExtractRegexFields that won't throw out a whole record if the input doesn't match the regex? For example, I have the component used for separating city/state/zip, but occasionally that input field will be blank. If you still work with Talend and have any insight on preserving rows with non-matching input, I'd love to hear it.
ReplyDeleteI surrounded the expression with a ? operator, and that mostly fixed the problem. Unchecking the die-on-error box did not change how records were rejected.
Delete