Skip to main content

Slowly Changing Dimension (SCD) with ODI

SCD Types
1) Overwrite old data with new
2) Adding a new row with surrogate key
3) Adding columns represents first, last and previous data


Other: If data is big, you can take monthly snapshots to save difference. Data is smaller and you want to track multiple columns, then you can take daily snapshots.


SCD type 2 is mostly prefered as keeps old versions. So let's roll on ODI for SCD type 2.

Our example will be sh.channels table comes default on Oracle db;

CREATE TABLE CHANNELS
(
  CHANNEL_ID        NUMBER                      NOT NULL,
  CHANNEL_DESC      VARCHAR2(20 BYTE)           NOT NULL,
  CHANNEL_CLASS     VARCHAR2(20 BYTE)           NOT NULL,
  CHANNEL_CLASS_ID  NUMBER                      NOT NULL,
  CHANNEL_TOTAL     VARCHAR2(13 BYTE)           NOT NULL,
  CHANNEL_TOTAL_ID  NUMBER                      NOT NULL
);


INSERT INTO SH.CHANNELS ( CHANNEL_ID, CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL,
CHANNEL_TOTAL_ID ) VALUES (
3, 'Direct Sales', 'Direct', 12, 'Channel total', 1);
INSERT INTO SH.CHANNELS ( CHANNEL_ID, CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL,
CHANNEL_TOTAL_ID ) VALUES (
9, 'Tele Sales', 'Direct', 12, 'Channel total', 1);
INSERT INTO SH.CHANNELS ( CHANNEL_ID, CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL,
CHANNEL_TOTAL_ID ) VALUES (
5, 'Catalog', 'Indirect', 13, 'Channel total', 1);
INSERT INTO SH.CHANNELS ( CHANNEL_ID, CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL,
CHANNEL_TOTAL_ID ) VALUES (
4, 'Internet', 'Indirect', 13, 'Channel total', 1);
INSERT INTO SH.CHANNELS ( CHANNEL_ID, CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL,
CHANNEL_TOTAL_ID ) VALUES (
2, 'Partners', 'Others', 14, 'Channel total', 1);
COMMIT;


Channels_hist table is created for SCD;
CREATE TABLE CHANNELS_HIST
(
  CHANNEL_ID        NUMBER                      NOT NULL,
  CHANNEL_DESC      VARCHAR2(20 BYTE)           NOT NULL,
  CHANNEL_CLASS     VARCHAR2(20 BYTE)           NOT NULL,
  CHANNEL_CLASS_ID  NUMBER                      NOT NULL,
  CHANNEL_TOTAL     VARCHAR2(13 BYTE)           NOT NULL,
  CHANNEL_TOTAL_ID  NUMBER                      NOT NULL,
  ETL_START_DATE DATE,
  ETL_END_DATE DATE,
  CURRENT_FLAG NUMBER
);

After table is created, switch to ODI and open model tab, reverse channels_hist table, and select OLAP type: Slowly changing dimension option checked.


On the side window, scrool to Channel_hist model, expand model columns and double-click to channel_id, go to description tab and select Slowly Changing Behavior:Natural Key option.


Slowly Changing Behavior: Add row on change option is selected for CHANNEL_DESC, CHANNEL_CLASS, CHANNEL_CLASS_ID, CHANNEL_TOTAL, CHANNEL_TOTAL_ID columns.
Following Fields and Selected Slowly Changing Behaviors:
ETL_START_DATE - Starting timestamp,
ETL_END_DATE - Ending timestamp,
CURRENT_FLAG - Current record flag


IKM Slowly Changing Dimension is selected on flow tab of interface INT_CHANNELS_HIST.



When interface is executed, etl_end_date field gets 01/01/2400 as hard-coded in KM even I mapped it to to_date('29990101', 'YYYYMMDD').




First version of Channels_hist table is right below:
select * from CHANNELS_HIST

CHANNEL_ID CHANNEL_DESC CHANNEL_CLASS CHANNEL_CLASS_ID CHANNEL_TOTAL
CHANNEL_TOTAL_ID ETL_START_DATE ETL_END_DATE CURRENT_FLAG

3 Direct Sales Direct 12 Channel total 1 18.10.2011 21:53:17 1
9 Tele Sales Direct 12 Channel total 1 18.10.2011 21:53:17 1
5 Catalog Indirect 13 Channel total 1 18.10.2011 21:53:17 1
4 Internet Indirect 13 Channel total 1 18.10.2011 21:53:17 1
2 Partners Others 14 Channel total 1 18.10.2011 21:53:17 1


And now I will update data:
UPDATE CHANNELS
SET CHANNEL_DESC='Telephone'
where channel_id=9;

UPDATE CHANNELS
SET CHANNEL_DESC='Direct'
where channel_id=3;

commit;

ODI generates two new records holds new data after Re-executing the interface. Old versions flagged as current_flag=0. ETL_END_DATE updated to 18.10.2011 23:32:07. And new data shows us updated versions as below:
CHANNEL_ID CHANNEL_DESC CHANNEL_CLASS CHANNEL_CLASS_ID CHANNEL_TOTAL CHANNEL_TOTAL_ID ETL_START_DATE ETL_END_DATE CURRENT_FLAG
--Old records
3 Direct SalesDirect 12 Channel total 1 18.10.2011 23:11:30 18.10.2011 23:32:07 0
9 Tele Sales Direct 12 Channel total 1 18.10.2011 23:11:30 18.10.2011 23:32:07 0

--New records
3 Direct Direct 12 Channel total 1 18.10.2011 23:32:07 01.01.2400 1
9 Telephone Direct 12 Channel total 1 18.10.2011 23:32:07 01.01.2400 1

Instead of etl_start_date and etl_end_date fields, we can use insert and update dates of source table if existed. But we have created our insert-update dates in SCD system.

Actual, last, current records can be selected as below: 

select * from CHANNELS_HIST
where current_flag=1

Now one more test, add a new record:

insert into sh.channels
values(10, 'Brochure', 'Indirect', 13, 'Channel total', 1);

commit;

Re-execute the interface and select new record with channel_id.

select * from channels_hist
where channel_id=10

CHANNEL_ID CHANNEL_DESC CHANNEL_CLASS CHANNEL_CLASS_ID CHANNEL_TOTAL CHANNEL_TOTAL_ID ETL_START_DATE ETL_END_DATE CURRENT_FLAG

10 Brochure Indirect 13 Channel total 1 18.10.2011 23:45:11 01.01.2400 1


Second more test :D
Delete the new record and see how is the change on SCD.

delete from channels
where channel_id=10;

commit;

Nothing changed!!!
Explanation on IKM Slowly changing dimension definition mentions about this restriction:
- As this Knowledge Module is dedicated to Datawarehouse Projects (where data should not be removed), when using it with a journalized Source Datastore, it is not possible to synchronize deletions. Therefore, data should be manually filtered to exclude any source delete events (add a filter on JRN_FLAG = 'I').

Hope this will be helpfull. See ya! : D

Comments

Popular posts from this blog

ODI KM Adding Order by Option

You can add Order by statement to queries by editing KM.I have edited IKM SQL Control Append to provide Order by.  1) Add an option to KM named USE_ORDER_BY, its type is Checkbox and default value is False. This option determines you want an order by statement at your query. 2)Add second option to KM named ORDER_BY, type is Text. You will get order by values to your query by this option. 3) Editing Insert New Rows detail of KM. Adding below three line code after having clause. That's it! <% if (odiRef.getOption("USE_ORDER_ BY").equals("1")) { %> ORDER BY <%=odiRef.getOption("ORDER_BY" )%> <%} %>  If USE_ORDER_BY option is not used, empty value of ORDER_BY option get error. And executions of KM appears as such below; At this execution, I checked the KM to not get errors if ORDER_BY option value is null. There is no prove of ORDER BY I'm glad.  Second execution to get  Ord

Creating Yellow Interface in ODI

Hello everyone! In Oracle data integrator (ODI), an  interface  is an object which populates one datastore, called the  target , with data coming from one or more other datastores, known as  sources . The fields of the source datastore are linked to those in the target datastore using the concept of  Mapping . Temporary interfaces used in ODI are popularly known as  Yellow Interfaces . It is because ODI generates a yellow icon at the time of creation of a yellow interface as opposed to the blue icon of a regular interface. The advantage of using a yellow interface is to avoid the creation of  Models each time you need to use it in an interface. Since they are temporary, they are not a part of the data model and hence don’t need to be in the Model. So let’s begin and start creating our yellow interface! Pre-requisites : Oracle 10g Express Edition with *SQL Plus, Oracle Data Integrator 11g. Open *SQL Plus and create a new table  Sales  in Oracle. You can use any existing ta

Running Count in Talend Open Studio

Most Talend components keep a count of the records processed using variables like NB_LINE or NB_LINE_OK.  But these are only available after all processing is completed.  Define your own counter variable to keep a running count for use in a tMap. Variables like tFilterRow.NB_LINE or tAccessOutput.NB_LINE_INSERTED can be used to report the number of affected lines after a subjob's processing.  However, it may be of use to get the current line index for use in a tMap.  The index variables used to form NB_LINE aren't available during processing; they're only written out the globalMap at the end of processing. In this example, staging records are loaded from Excel to Access.  The order in which the Excel records are read is preserved in a database column called DISPLAY_SEQ_NB.  Note that there is an auto-increment column used for record ID in the Access table.  This could be used to infer a loading order, but this job uses a separate column to keep the ID as a meaningless surr