Here are 10 timesaving tips for Talend Open Studio for Data Integration that I used.
#10 Drag and drop a component on a connector
From the Palette, select a component and drag over a connection. When the connection is highlighted, release the mouse. This works particularly well with tFilterRow and tLogRow and avoids having to connect/reconnect several connections.
#9 Copy and paste
Sometimes, you may want to take a second pass on a data file to handle another case. Selecting a subjob, right-clicking "Copy", then right-clicking "Paste", will produce an identical subjob with the components renamed so that it readily compiles and runs. You can then tweak this second job to handle the second pass. Use this with caution as too much copy-paste-and-tweaking leads to maintenance problems and makes the job less receptive to change.
#8 Chain multitable inserts
With good business keys, you can take multiple passes over the data to write into normalized tables. Without good keys, or if you want to streamline your Talend job, you can chain together DB output components to write to multiple tables. In this screenshot, two DB writes are connected by a tMSSqlLastInsertId component.
For more detail on using tMSSqlLastInsertId, see this post: tMSSqlLastInsertId Returns 0.
#7 Drag jobs on canvas to create tRunJobs
You can drag a tRunJob component from the Palette to the canvas and set the child job. A faster way is to drag a job from the Repository onto the canvas, saving a step to lookup the job in Component View.
#6 Manage documentation through Talend
Project documentation and configuration can be managed through the Talend Repository. This makes it eligible for export from a single place (Export Items). This is used for sharing among workspaces; the documentation isn't exported with the Export Job function.
#5 Re-use connections
Each RDBMS has a connection component: tOracleConnection, tMSSqlConnection. Add a connection to your component and reference the component with the "Use an existing connection" option in other DB components like tMSSqlOutput or tOracleInput. This centralizes the connections configuration which includes items like username/password, auto-commit settings, and JDBC properties.
#4 Properties files
When you're managing different environments, particularly a production environment, a text-based properties file is a convenient way to configure your jobs. The properties file can be versioned, is easily readable, and supports files differences with Linux commands like "diff".
This is a video on using properties files in Talend Open Studio.
#3 Context groups
The standard way to parameterize a set of Talend Open Studio jobs is through Context Groups. These are sets of global variables grouped by an environment (dev, test, prod) which can be toggled through an export or via the Run View.
This post describes using Contexts and exporting them in more detail: Applying Contexts to Talend Open Studio Jobs.
#2 Use queries
While Talend Open Studio will generate queries based on a table for your input components like tOracleInput, you can save your own queries and reference them throughout your jobs. This has two advantages. The first is to allow for queries that span multiple tables and that exceed the query-generation capability of Talend Open Studio (think Oracle set-based operations). The second is to produce a more robust job by leaving off irrelevant queries that may be removed later.
For example, if a lookup involves only a name and an id field, there's no need to add other fields that may be dropped before the job goes to production. If a column is dropped and it's not relevant to the query, it shouldn't break the job.
#1 Durable schemas
Schemas should be based on the Repository rather than Built-in where ever possible. In some cases, components like tMSSqlOutput can be adjusted to ignore columns for a write operation using the Advanced tab. That way, a complete set of columns can still be referenced in a Repository schema, but there won't be any contention over auto-generated fields.
This tip also works with #2 to support more robust jobs. If a subset of fields is used repeatedly -- say an id/name pair -- define it as a Generic Schema or other Schema and store it in the repository. That way, the field list never becomes out of sync with the database (as long as the lookup fields are still valid).
#10 Drag and drop a component on a connector
From the Palette, select a component and drag over a connection. When the connection is highlighted, release the mouse. This works particularly well with tFilterRow and tLogRow and avoids having to connect/reconnect several connections.
Drag a Component onto a Connection |
Sometimes, you may want to take a second pass on a data file to handle another case. Selecting a subjob, right-clicking "Copy", then right-clicking "Paste", will produce an identical subjob with the components renamed so that it readily compiles and runs. You can then tweak this second job to handle the second pass. Use this with caution as too much copy-paste-and-tweaking leads to maintenance problems and makes the job less receptive to change.
Subjob Context Menu with Copy Command |
With good business keys, you can take multiple passes over the data to write into normalized tables. Without good keys, or if you want to streamline your Talend job, you can chain together DB output components to write to multiple tables. In this screenshot, two DB writes are connected by a tMSSqlLastInsertId component.
Two Output Components Chained Together |
#7 Drag jobs on canvas to create tRunJobs
You can drag a tRunJob component from the Palette to the canvas and set the child job. A faster way is to drag a job from the Repository onto the canvas, saving a step to lookup the job in Component View.
Creating tRunJob from the Repository |
Project documentation and configuration can be managed through the Talend Repository. This makes it eligible for export from a single place (Export Items). This is used for sharing among workspaces; the documentation isn't exported with the Export Job function.
A Config File Maintained in Talend Documentation |
#5 Re-use connections
Each RDBMS has a connection component: tOracleConnection, tMSSqlConnection. Add a connection to your component and reference the component with the "Use an existing connection" option in other DB components like tMSSqlOutput or tOracleInput. This centralizes the connections configuration which includes items like username/password, auto-commit settings, and JDBC properties.
A Job with a tMSSqlConnection Component |
When you're managing different environments, particularly a production environment, a text-based properties file is a convenient way to configure your jobs. The properties file can be versioned, is easily readable, and supports files differences with Linux commands like "diff".
This is a video on using properties files in Talend Open Studio.
The standard way to parameterize a set of Talend Open Studio jobs is through Context Groups. These are sets of global variables grouped by an environment (dev, test, prod) which can be toggled through an export or via the Run View.
Run View Referencing Several Contexts |
#2 Use queries
While Talend Open Studio will generate queries based on a table for your input components like tOracleInput, you can save your own queries and reference them throughout your jobs. This has two advantages. The first is to allow for queries that span multiple tables and that exceed the query-generation capability of Talend Open Studio (think Oracle set-based operations). The second is to produce a more robust job by leaving off irrelevant queries that may be removed later.
For example, if a lookup involves only a name and an id field, there's no need to add other fields that may be dropped before the job goes to production. If a column is dropped and it's not relevant to the query, it shouldn't break the job.
A Talend Open Studio Query |
Schemas should be based on the Repository rather than Built-in where ever possible. In some cases, components like tMSSqlOutput can be adjusted to ignore columns for a write operation using the Advanced tab. That way, a complete set of columns can still be referenced in a Repository schema, but there won't be any contention over auto-generated fields.
This tip also works with #2 to support more robust jobs. If a subset of fields is used repeatedly -- say an id/name pair -- define it as a Generic Schema or other Schema and store it in the repository. That way, the field list never becomes out of sync with the database (as long as the lookup fields are still valid).
Comments
Post a Comment