DataStage lab: What is Lookup stage? Uses of Look up stage?

Showing posts with label What is Lookup stage? Uses of Look up stage?. Show all posts

Tuesday, 21 May 2013

What is Lookup stage? Uses of Look up stage?

It performs on dataset read into memory from any other parallel job stage that can

output data.

The main uses of the lookup stage is to map short codes in the input dataset onto

expanded information from a look up table which is then joined to the data coming from

input. For example, some we get the data with customers name and address. Here the

data identifies state as a two letters or three letters like mel for melbourne or

syd for sydney. But you want the data to carry the full name of the state by

defining the code as the key column. In this case lookup stage used very much.

It will reads each line, it uses the key to look up the stage in the lookup table.

It adds the state to the new column defined for the output link.

So that full state name is added to the each row based on codes given.

If the code not found in the lookup table, record will be rejected.

Lookup stage also performs to validate the row.

Look Up stage is a processing stage which performs horizontal combining.

Lookup stage Supports

N-Inputs ( For Norman Lookup )

2 Inputs ( For Sparse Lookup)

1 output

And 1 Reject link

Up to Datastage 7 Version We have only 2 Types of LookUps

a) Normal Lookup and b) Sparse Lookup

But in Datastage 8 Version, enhancements has been take place. They are

c) Range Look Up And d) Case less Look up

Normal Lookup:-- In Normal Look, all the reference records are copied to the memory and the primary records are cross verified with the reference records.

Sparse Lookup:--In Sparse lookup stage, each primary records are sent to the Source and cross verified with the reference records.

Here , we use sparse lookup when the data coming have memory sufficiency
and the primary records is relatively smaller than reference date we go for this sparse lookup.

Range LookUp:--- Range Lookup is going to perform the range checking on selected columns.

For Example: -- If we want to check the range of salary, in order to find the grades of the employee than we can use the range lookup.

How to do Sorting without Sorting stage

You can do it as normal process first as follows.

If we want to read the data using Sequential File

Design as follows : -------------

Seq. File ------------------------Dataset File

To read the data in Sequential file

Open Properties of Sequential file

and give the file name.

Now you can give the file path, by clicking on the browse for the file.

And in Options select True ( If the first line is column name )

You can just leave the rest of the options as it is.

Now go to Column and click on load, then select the file you like to read from the

table definitions .

( This file should be same which you have given in the properties. )

Now in the Target Dataset - Give file name.

Now for the sorting process.

In the Target Open Dataset properties

And go to Partitioning ---- Select Partitioning type as Hash

In Available Columns Select Key Column ( E_Id for EXAMPLE) to be sorted.

Click Perform Sort

Click Ok

Compile And Run

The data will be Sorted in the Target.

Create a DataStage Parallel Job

Now that you understand how data is processed in DataStage, let us create a sample DataStage job

Step 1:

Creating a new job and save it

Step 2:

Understanding the requirement of the job
Requirement: Let’s say that the customer wants a DataStage jobs that reads employee data from the file Emp.txt (which is comma separated) and this data has to be written to another file Emp_Grade.txt.

Input Data for the file à Emp.txt

Emp_Name	Emp_Salary	Emp_Id	Emp_Grade
ABC	10	1
DEF	30	2
GHI	20	3
JKL	60	4
MNO	100	5

Transformation: If the employee’s salary is greater than 50 the grade is ‘A’ else it is ‘B’
Now that we have the requirement, let’s create the job

Step 3:

Adding Stages to the job canvas
Since we are reading data from a text file which is comma separated, we will use the sequential file stage (details about how to use each stage are elaborated in next lessons)

All the Stages for DataStage are listed in the Palatte. The Palatte contain stages (icons) that can be dragged on to the canvas of DataStage

To add, click on the stage that we need (in our case a sequential file stage) and then click on the Job Canvas

As per our requirement we need to perform a transformation on the data that is extracted, so let’s take the Transformer Stage and add it to the canvas

Our output as per the requirement is again a text file, so let’s drag another Sequential File Stage to the canvas

Now that we have the stages in places, we have to create LINKS between each of these stages so that data flows from once stage to another. Links can be added in two ways to a job:-

By using the link stage in the General Tab of the Palatte
Right click on the Source Stage (Sequential_File_0), hold the mouse button and drag the link that is generated to the Target Stage (Transformer_1). – Easier way

We need 2 links in our job:

One from the source Sequential File Stage (Sequential_File_0) to the Transformer Stage (Transformer_1) – DSLink3
The 2^nd from the Transformer Stage to the Target Sequential File Stage (Sequential_File_2) – DSLink4

A good practice is to make sure that the stages are aligned in such a way that data always flows from Left to Right

Step 4:

Editing the Stages to perform Extraction, Transformation & Loading operations
For the stages to read-write/transform data, we have to edit them and let them know what to do with the data.

To edit a stage, we have double click on the stage in the Job Design. Each stage will have a unique properties window.

As per our requirement, we have to first read data from the file Emp.txt. So let us edit the sequential file stage (Sequential_File_0) to make it read the data in the file.
Editing a Sequential File Stage to read data from a CSV file

Double click on the stage and the following window opens
The above window that pops up has all the properties that can be defined for a particular stage

It has two pages
1. Stage
2. Output
In the Stage Page, we will be able to define advanced properties like Combinability Mode, Preserve Partitioning, Node Map Constraint etc., àThese options are used while debugging or during performance tuning, so for now we will leave them as they are.

The Properties in the Output Page are the ones that are important for reading the required data. We have 5 tabs in the Output page
1. General: We can specify the description of the stage. This is basically to maintain readability so that future developers will know what this stage is doing
2. Properties: In this tab we have to specify the path of the file that we intend to read and select the appropriate options that apply to our requirement
3. Format: In Format tab, we specify the record and field delimiters and other properties like null handling etc.
4. Columns: Here we define the column names and the corresponding meta data for each column
5. Advanced: Will have details of the buffering mode and is usually left as it is.
So, first let us enter the path of the source file under the Property “File“. Also notice that the first line of the input file has column names. So under the options section in Properties tab, we will have the Property “First Line is Column Names” set to True. We will leave the other properties as they are as we have nothing in the requirement that warrants a change in them.

Now lets’ go to the “Format” tab. If you notice below, we have two sections – Record level and Field defaults
Record level – setting apply to each record in the input data
Field defaults – apply to each column/field in a record

For out requirement, the record level delimiter if end (so we leave it as it is) and field delimiter is comma (which also we leave as it is). Note that in “Field defaults” we have another property called “Quote” which is set to double. This is used when our input data has double quotes for each field. Since we don’t have it in our input data we select the property and click on the delete button on our keyboard to remove it.

One additional property that we need to set at the field level is “Null field value” (available under the Field defaults options). This is because in our input file the column Emp_Grade is NULL, this may result in the Sequential file stage dropping the record. To avoid such a scenario, we usually select the Null field value property and set it to empty (”) i.e., one open single quote and one close single quote without any space between the quotes.

Now that the file and the delimiters are setup, let us enter go to the columns tab to enter the Meta-data for reading the input file. Below is the screen shot showing column names and the data types for each of them; entered as per our requirement.

Now go back to the “Properties” tab and you will find that the “View Data” button on the right corner of the stage is enabled. Click on the button to see if the data is being fetched correctly from the file. If it is then the following should be displayed

Since the data looks fine. Click “Close” on the popup and then click on OK on the stage properties to save the changes done.

Moving to the Transformer Stage:

Since the scope of this lesson is limited to creating a simple job, I will not dwell into the details of the Transformer stage. As per our requirement, we have to have the grade of the employee as “A” if his salary id greater than 50 and it is “B” if less than 50

Here is a screen shot of the transformer stage

The left half is the Input section and the right half is the output section. If you observer, the columns that we defined in the sequential file stage are visible in the left half. This is because of the link that we had between the stages. To have the data flow from input to output, we have to select all the columns, drag and drop them to the output link, i.e., DSlink4 box. Here is a screen shot

In the output part of the transformer stage we have derivation (explained in detail under the lesson “Transformer Stage”). Double click on the derivation of EMP_Grade and replace “DSLink3.Emp_Grade” with If DSLink3.Emp_Salary>50 then ‘A’ else ‘B’

If there are any errors in the derivation it will be shown in red otherwise it will be normal (blue).

Editing the Output Sequential File:
In the “Properties” tab enter the path and name of the output file and change the property “First Line is column names” to “True”

In the Format tab, remove the Quote property and click “Ok” to save the property changes to the stage.

We are now done with coding part. So now we have to save our job as “FirstJob” (explained in lesson “DataStage Job“). Once the job is saved, like in any other programming language we have to compile the job.

Click on the Compile button on the menu to compile the job. If successful the following message will be displayed

Step 5:

Running the Job – DataStage Director
Open DataStage Director and navigate to the folder in which the job is saved.

Select the job that we just created and click on the run button () and the following popup will be displayed

In the popup, click on the run button to start running the DataStage job. When we run the job Extraction, transformation and loading activities that we developed in the designer will be executed giving us the required output data.

Note that in case of any issue with the job the status in the Director will be “Aborted”. For successful runs it will be “Finished” or “Finished (see log)”.

Our job finished successfully, so here is the output file that was generated

Parallel Processing & Partition Techniques

Before jumping to designing the job let us understand some basic processing features in DataStage -

Parallel Processing

Let us now see how DataStage Parallel jobs are able to process multiple records simultaneously. Parallelism in DataStage is achieved in two ways, Pipeline parallelism and Partition parallelism

Pipeline Parallelism executes transform, clean and load processes simultaneously. It works like a conveyor belt moving rows from one stage to another. Downstream processes start while the upstream processes are still running. This reduces disk usage for staging area and also reduces the idle time on the processors
Let us take see the below example to better understand the concept.

Let us assume that the input file has only one column i.e.,

Customer_Name

Clark

Sonali

Michael

Sharapova

After the 1st record(Clark) is extracted by the 1st Stage (Sequential File Stage) and moved to the second stage for processing, the 2nd record(Sonali) is immediately even before the 1st record reaches the final stage(Peek Stage). Thereby, by the time the 1st record reaches he peek stage the 3rd record(Michael) would have been extracted in the Sequential file stage.

Partition Parallelism divides the incoming stream of data into subsets that will be processed separately by a separate node/processor. These subsets are called partitions and each partition is processed by the same operation.

Let us understand this in layman terms:
As we know DataStage can be implemented in an SMP or MPP architecture. This provides us with additional processors for performing operations.
To leverage this processing capability, Partition Parallelism was introduced in the Information Server(DataStage).

Let us assume that in our current set up there are 4 processors available for use by DataStage. The details of these processors are to be defined in the DataStage Configuration File(to be dealt with in later topics).
Sample Configuration File

{
 node "node1"
 {
  fastname "newton"
  pools ""
  resource disk "/user/development/datasets" {pools ""}
  resource scratchdisk "/user/development/scratch" {pools ""}
 }
 
 node "node2"
 {
  fastname "newton"
  pools ""
  resource disk "/user/development/datasets" {pools ""}
  resource scratchdisk "/user/development/scratch" {pools ""}
 }
 
 node "node3"
 {
  fastname "newton"
  pools ""
  resource disk "/user/development/datasets" {pools ""}
  resource scratchdisk "/user/development/scratch" {pools ""}
 }
 
 node "node4"
 {
  fastname "newton"
  pools ""
  resource disk "/user/development/datasets" {pools ""}
  resource scratchdisk "/user/development/scratch" {pools ""}
 }
 
}

Using the configuration file, DataStage can identify the 4 available processors and can utilize them to perform operations simultaneously.
For the same example,

Customer_Name

Clark

Sonali

Michael

Sharapova

if we have to add “_Female” to the end of the string for names that starts ‘S’ and “_Male” for names that don’t. We will use the Transformer Stage to perform the operation. By selecting the 4 node configuration file, we will be able to perform the required operation on the names four times faster then before.

The required operation is replicated on each processor i.e “Node” and each Customer name will processed on a separate node simultaneously, thereby greatly increasing the performance of DataStage Jobs.

Partitioning Techniques

DataStage provides the options to Partition the data i.e send specific data to a single node or also send records in round robin fashion to the available nodes. There are various partitioning techniques available on DataStage and they are

Auto: – default option
It chooses the best partitioning method depending on:
The mode of execution of the current stage and the preceding stage.
The number of nodes available in the configuration file.

DB2: – rarely used
Partitions an input dataset in the same way that DB2 would partition it.
For example, if this method is used to partition an input dataset containing information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node.

Entire: – less frequent use
Every node receives the complete set of input data i.e., form the above example, all the records are sent to all four nodes
We mostly use this partitioning method with stages that create lookup tables from their input.

Hash: – frequently used
Once we select this partition we are also required to select the key column. Based on the key column values, data is sent to the available nodes, i.e., records/rows with the same value for the defined key field will go the same processing node
In our example if say that the first character on the name is the key field, then Sonali and Sharapova will go to the same processing node and the other two names can be processed on any other node.
Note: When Hash Partitioning, hashing keys that create a large number of partitions should be selected.
Reason: For example, if you hash partition a dataset based on a zip code field, where a large percentage of records are from one or two zip codes, it can lead to bottlenecks because some nodes are required to process more records than other nodes.

Modulus – frequently used
Performs the same functionality as the Hash partition but key field(s) in modulus partition can only be a numeric field. The Modulus of the numeric field is calculated and partitioning is done based on that value.

Random: – less frequent use
Records are randomly distributed across all processing nodes.
Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition.
The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.

Range: – rarely used
It divides a dataset into approximately equal-sized partitions, each of which contains records with key columns within a specific range. It guarantees that all records with same partitioning key values are assigned to the same partition.
Note: In order to use a Range partitioner, a range map has to be made using the ‘Write range map’ stage.

Round robin: – frequently used
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over.
This method is useful for resizing partitions of an input data set that are not equal in size.
The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.

Same: – frequently used
In this partitioning method, records stay on the same processing node as they were in the previous stage; that is, they are not redistributed. Same is the fastest partitioning method.
This is normally the method DataStage uses when passing data between stages in your job(when using “Auto partition”).
Note – It implements the same Partitioning method that is used in the previous stage.