DataStage Basic Designer Guide: Stages
The core of the ETL development relies on the capability of our ETL tool to perform modifications to the treated data using its components.
The most of the ETL tools use some components that have specific code inside that allows developer extract, manipulate, conform, clean and load data.
DataStage has a lot of this components named ‘stages’. The are a lot of section inside the palette of stages in DataStage, in my opinion this are the sections and stages more popular for ETL Developers.
I will try to explain how these stages work and add common questions you might be asked in a technical interview.
DB2 Connector / ODBC Connector / Oracle Connector :
- The database connectors allow you to read, write, and look up data in DB2/ODBC/Oracle database tables.
- Import metadata of table and column definitions, lists of databases, tables, views, and database aliases into a job.
- Manage rejected data.
Peek (Development/Debug)
It can have a single input link and any number of output links.
The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.
‘For what purposes is the peek stage mostly used?’
For debugging purposes since it doesn't need to store data and it works like a target needed for any job.
Sequential File (File)
It’s a file stage that allows you to read data from or write data one or more flat files.
‘What kind of files we can use in a sequential file stage?’
You can open any kind of file that doesnt need any particular decodification like a .csv or .txt file
DataSet (File)
Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds.
‘What’s the difference between a sequential file stage and a dataset stage?’
It is made base on a specific codification way used by IBM, so it is basically used by DataStage only.
Unstructured Data (File)
Unstructured data is information that does not have a predefined data model or does not fit well into relational tables. Unstructured data can be text from books, journals, metadata, audio, video files, the body of word processor documents, web pages, and presentation charts. In this release, the Unstructured Data stage supports only Microsoft Excel files as data sources.
‘How can you manipulate an excel file in DataStage?’
Using the Unstructured Data Stage to determine the excel sheet and range of cells you want to import from an excel file.
Aggregator (Processing)
It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group.
The summed totals for each group are output from the stage via an output link.
‘What kind of calculation an aggregation stage can performs?’
It can gets from a group of data the following output:
Maximum value, minimum value, range,sum, variance and more
Remove Duplicates (Processing)
The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.
Copy (Processing)
Copy the data from a source to every output link without modification.
Filter (Processing)
It can have a single input link and a any number of output links and, optionally, a single reject link. It works like a router that sends data thru different outputs that can have (optionally) a filter condition.
Funnel (Processing)
It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set.
The stage can have any number of input links and a single output link.
‘Whats the difference between a Funnel Stage and Copy Stage?’
A funnel stage works very similar to a filter stage but without filter conditions. So a funnel and copy are inverse stages, since the funnel stage take the data from multiple sources and put that into a output, in the other hand, copy stage takes the data from a source and put that data into multiples outputs.
Join (Processing)
The Join stage is a processing stage. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set.
The Join stage is one of three stages that join tables based on the values of key columns. The other two are: Lookup Stage, Merge Stage
‘What type of join operations can perform a join stage?’
Inner, Left Outer, Right Outer, Full Outer
Lookup (Processing)
It is a stage that allows you to perform lookup operations, that is, from two or more sources where one of them is our single output link and the rest are our reference links, we can perform right outer join or inner join operations and the result of that operation will reach our target out. Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the key’s values, the row is rejected.
‘When to use a lookup stage or a join stage?’
The reference data sets is a concern. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a Lookup stage might thrash because the reference data sets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation. So, if the reference data sets are big enough to cause trouble, use a join
Merge (Processing)
The Merge stage combines a master data set with one or more update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record that are required
Pivot Enterprise (Processing)
The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically. Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows. The output data of the horizontal pivot action typically has fewer columns, but more rows than the input data. With vertical pivoting, you can map several sets of input columns to several output columns. Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array size determines the number of rows in the output data. The output data of the vertical pivot action typically has more columns, but fewer rows than the input data.
Slowly Changing Dimension (Processing)
The SCD stage reads source data on the input link, performs a dimension table ookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements.
‘What type of SCD can you perform using this stage?’
SCD Type 1
Overwrites an attribute in a dimension table.
SCD Type 2
Adds a new row to a dimension table.
Sort (Processing)
Sort, an active stage, sorts a variety of data. It sorts small amounts of data efficiently in memory when there is enough main memory available.
It sorts large amounts of data using temporary disk storage, not virtual memory swap space.
Transformer (Processing)
Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions.
Hierarchical Data (Real Time)
Use the Hierarchical Data stage to create powerful hierarchical transformations, parse and compose JSON/XML data, and invoke REST web services with high performance and scalability.
WebSphere MQ Connector (Real Time)
You can use the IBM® WebSphere® MQ connector in your jobs to read messages from and write messages to message queues in IBM WebSphere MQ enterprise applications. You can use the IBM WebSphere MQ connector in any of the following ways:
- As an intermediary that enables applications to communicate by exchanging messages.
- As a path for the transmission of older data to a message queue.
- As a message queue reader for transmission to a non-messaging target.