Writing to a set of related tables may present a timing challenge to Talend Open Studio developers. This post shows how to use a stored procedure to write to two related MS SQL Server tables.
Databases like MS SQL Server and MySQL use special identity columns to provide a unique auto-incremented value for a row. This value is often used as a primary key. Working with these columns presents a special challenge to Talend Open Studio developers because there may not be a business key that supports a lookup on the generated id.
Related Tables
The tables Business and Contact are releated by a foreign key 'businessId' which is the primary key of the Business table. A Contact record can't be inserted without a valid foreign key. 'businessName' is not unique, so a lookup cannot be established after-the-fact.
Get Last ID
Talend provides components for retrieving the last auto-generated id: tMSSqlLastInsertId, tMysqlLastInsertId. These components are used in the middle of a sequence of output components. For example, the following job shows a tMSSqlLastInsertId in between tMSSqlOutput components "Business" and "Contact". Business writes to the Business table. Contact writes to the Contact table. tMSSqlLastInsertId adds a special 'last_insert_id' field in the processing flow which is used to carry the generated businessId into the INSERT statement for Contact.
Pass-Through Schemas
When using tMSSqlOutput components, you'll generally use a schema from a repository which Talend Open Studio can keep in sync with the database. This improves maintenance. In the above job however, the schema used for Business needs to be expanded in order to carry the Contact fields over to the Contact component.
Because the fields like Primary_Contact_FirstName aren't part of the Business table insert, they are excluded using the Advanced settings tab / Field options. Note the unchecked "Use batch" which is required for the last_insert_id to be set.
With a Stored Procedure
The above example works, but the approach can become more difficult to maintain if there is another layer of dependency, say a load that will write to a dependent of Contact the requires the auto-generated contactId. Stored procedures provide a way to encapsulate the call. A single Talend component is used -- tMSSqlSP -- that is based on a composite schema of all the fields for all the tables, as was used in the Pass-through schema for Business in the previous example.
Such a job looks like the following with a single tMSSqlSP replacing the Business/last_insert_id/Contact processing chain.
A composite schema containing all the fields for all the table inserts is used. This set of fields matches the number of parameters for the stored procedure.
And the corresponding configuration uses the following parameters. Note the green double-plus button will take the schema and create a full set of parameters, one for each field in the schema.
Stored Procedure Source
Like the original job, the stored procedure writes to both the Business and the Contact tables. For the Contact write, a SELECT @@IDENTITY is used to set a variable. This is the same SELECT executed by a tMSSqlLastInsertId component.
CREATE PROC dbo.InsertBusiness
@businessName AS VARCHAR(50),
@streetAddr1 AS VARCHAR(50),
@streetAddr2 AS VARCHAR(50),
@city AS VARCHAR(50),
@usState AS VARCHAR(2),
@zip AS VARCHAR(10),
@phoneNumber AS VARCHAR(25),
@faxNumber AS VARCHAR(25),
@websiteURL AS VARCHAR(255),
@firstName AS VARCHAR(50),
@lastName AS VARCHAR(50),
@contactPhoneNumber AS VARCHAR(25),
@email AS VARCHAR(100)
AS
DECLARE @businessId INT
INSERT INTO Business (businessName, streetAddr1, streetAddr2, city,
state, zip, phoneNumber, faxNumber, websiteURL)
VALUES (
@businessName, @streetAddr1, @streetAddr2, @city, @usState, @zip,
@phoneNumber, @faxNumber, @websiteURL
)
SELECT @businessId = @@IDENTITY
INSERT INTO Contact (businessId, firstName, lastName, phoneNumber, email)
VALUES (
@businessId, @firstName, @lastName, @contactPhoneNumber, @email
)
Talend Open Studio's management of schemas is a big productivity boost. Both examples presented in this post would benefit with a composite schema -- one that combines Business and Contact -- saved in the Repository. The tMSSqlLastInsertId technique can be used reliably, but if there are more related writes, consider encapsulating the data with a stored procedure call. Also, if performance is critical, you can move certain lookups into the database in a stored procedure where the retrieval is much faster. Note that in both cases, the inserts are made row-by-row with the input. This is because batching is turned off in the tMSSqlLastInsertId example, though the commit may be deferred.
Databases like MS SQL Server and MySQL use special identity columns to provide a unique auto-incremented value for a row. This value is often used as a primary key. Working with these columns presents a special challenge to Talend Open Studio developers because there may not be a business key that supports a lookup on the generated id.
Related Tables
The tables Business and Contact are releated by a foreign key 'businessId' which is the primary key of the Business table. A Contact record can't be inserted without a valid foreign key. 'businessName' is not unique, so a lookup cannot be established after-the-fact.
Two Tables Related by an Auto Increment Primary Key |
Talend provides components for retrieving the last auto-generated id: tMSSqlLastInsertId, tMysqlLastInsertId. These components are used in the middle of a sequence of output components. For example, the following job shows a tMSSqlLastInsertId in between tMSSqlOutput components "Business" and "Contact". Business writes to the Business table. Contact writes to the Contact table. tMSSqlLastInsertId adds a special 'last_insert_id' field in the processing flow which is used to carry the generated businessId into the INSERT statement for Contact.
Job Using last_insert_id |
When using tMSSqlOutput components, you'll generally use a schema from a repository which Talend Open Studio can keep in sync with the database. This improves maintenance. In the above job however, the schema used for Business needs to be expanded in order to carry the Contact fields over to the Contact component.
A Pass-Through Schema Carries Fields for Later Processing |
Ignore Pass-Through Fields |
The above example works, but the approach can become more difficult to maintain if there is another layer of dependency, say a load that will write to a dependent of Contact the requires the auto-generated contactId. Stored procedures provide a way to encapsulate the call. A single Talend component is used -- tMSSqlSP -- that is based on a composite schema of all the fields for all the tables, as was used in the Pass-through schema for Business in the previous example.
Such a job looks like the following with a single tMSSqlSP replacing the Business/last_insert_id/Contact processing chain.
Job Using a Stored Procedure for a Multitable Write |
Schema Used in tMSSqlSP Procedure |
Configuration of a tMSSqlSP Component |
Stored Procedure Source
Like the original job, the stored procedure writes to both the Business and the Contact tables. For the Contact write, a SELECT @@IDENTITY is used to set a variable. This is the same SELECT executed by a tMSSqlLastInsertId component.
CREATE PROC dbo.InsertBusiness
@businessName AS VARCHAR(50),
@streetAddr1 AS VARCHAR(50),
@streetAddr2 AS VARCHAR(50),
@city AS VARCHAR(50),
@usState AS VARCHAR(2),
@zip AS VARCHAR(10),
@phoneNumber AS VARCHAR(25),
@faxNumber AS VARCHAR(25),
@websiteURL AS VARCHAR(255),
@firstName AS VARCHAR(50),
@lastName AS VARCHAR(50),
@contactPhoneNumber AS VARCHAR(25),
@email AS VARCHAR(100)
AS
DECLARE @businessId INT
INSERT INTO Business (businessName, streetAddr1, streetAddr2, city,
state, zip, phoneNumber, faxNumber, websiteURL)
VALUES (
@businessName, @streetAddr1, @streetAddr2, @city, @usState, @zip,
@phoneNumber, @faxNumber, @websiteURL
)
SELECT @businessId = @@IDENTITY
INSERT INTO Contact (businessId, firstName, lastName, phoneNumber, email)
VALUES (
@businessId, @firstName, @lastName, @contactPhoneNumber, @email
)
Talend Open Studio's management of schemas is a big productivity boost. Both examples presented in this post would benefit with a composite schema -- one that combines Business and Contact -- saved in the Repository. The tMSSqlLastInsertId technique can be used reliably, but if there are more related writes, consider encapsulating the data with a stored procedure call. Also, if performance is critical, you can move certain lookups into the database in a stored procedure where the retrieval is much faster. Note that in both cases, the inserts are made row-by-row with the input. This is because batching is turned off in the tMSSqlLastInsertId example, though the commit may be deferred.
Comments
Post a Comment