DataStage lab: Parallel Processing & Partition Techniques

Before jumping to designing the job let us understand some basic processing features in DataStage -

Parallel Processing

Let us now see how DataStage Parallel jobs are able to process multiple records simultaneously. Parallelism in DataStage is achieved in two ways, Pipeline parallelism and Partition parallelism

Pipeline Parallelism executes transform, clean and load processes simultaneously. It works like a conveyor belt moving rows from one stage to another. Downstream processes start while the upstream processes are still running. This reduces disk usage for staging area and also reduces the idle time on the processors
Let us take see the below example to better understand the concept.

Let us assume that the input file has only one column i.e.,

Customer_Name

Clark

Sonali

Michael

Sharapova

After the 1st record(Clark) is extracted by the 1st Stage (Sequential File Stage) and moved to the second stage for processing, the 2nd record(Sonali) is immediately even before the 1st record reaches the final stage(Peek Stage). Thereby, by the time the 1st record reaches he peek stage the 3rd record(Michael) would have been extracted in the Sequential file stage.

Partition Parallelism divides the incoming stream of data into subsets that will be processed separately by a separate node/processor. These subsets are called partitions and each partition is processed by the same operation.

Let us understand this in layman terms:
As we know DataStage can be implemented in an SMP or MPP architecture. This provides us with additional processors for performing operations.
To leverage this processing capability, Partition Parallelism was introduced in the Information Server(DataStage).

Let us assume that in our current set up there are 4 processors available for use by DataStage. The details of these processors are to be defined in the DataStage Configuration File(to be dealt with in later topics).
Sample Configuration File

{
	node "node1"
	{
		fastname "newton"
		pools ""
		resource disk "/user/development/datasets" {pools ""}
		resource scratchdisk "/user/development/scratch" {pools ""}
	}
 
	node "node2"
	{
		fastname "newton"
		pools ""
		resource disk "/user/development/datasets" {pools ""}
		resource scratchdisk "/user/development/scratch" {pools ""}
	}
 
	node "node3"
	{
		fastname "newton"
		pools ""
		resource disk "/user/development/datasets" {pools ""}
		resource scratchdisk "/user/development/scratch" {pools ""}
	}
 
	node "node4"
	{
		fastname "newton"
		pools ""
		resource disk "/user/development/datasets" {pools ""}
		resource scratchdisk "/user/development/scratch" {pools ""}
	}
 
}

Using the configuration file, DataStage can identify the 4 available processors and can utilize them to perform operations simultaneously.
For the same example,

Customer_Name

Clark

Sonali

Michael

Sharapova

if we have to add “_Female” to the end of the string for names that starts ‘S’ and “_Male” for names that don’t. We will use the Transformer Stage to perform the operation. By selecting the 4 node configuration file, we will be able to perform the required operation on the names four times faster then before.

The required operation is replicated on each processor i.e “Node” and each Customer name will processed on a separate node simultaneously, thereby greatly increasing the performance of DataStage Jobs.

Partitioning Techniques

DataStage provides the options to Partition the data i.e send specific data to a single node or also send records in round robin fashion to the available nodes. There are various partitioning techniques available on DataStage and they are

Auto: – default option
It chooses the best partitioning method depending on:
The mode of execution of the current stage and the preceding stage.
The number of nodes available in the configuration file.

DB2: – rarely used
Partitions an input dataset in the same way that DB2 would partition it.
For example, if this method is used to partition an input dataset containing information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node.

Entire: – less frequent use
Every node receives the complete set of input data i.e., form the above example, all the records are sent to all four nodes
We mostly use this partitioning method with stages that create lookup tables from their input.

Hash: – frequently used
Once we select this partition we are also required to select the key column. Based on the key column values, data is sent to the available nodes, i.e., records/rows with the same value for the defined key field will go the same processing node
In our example if say that the first character on the name is the key field, then Sonali and Sharapova will go to the same processing node and the other two names can be processed on any other node.
Note: When Hash Partitioning, hashing keys that create a large number of partitions should be selected.
Reason: For example, if you hash partition a dataset based on a zip code field, where a large percentage of records are from one or two zip codes, it can lead to bottlenecks because some nodes are required to process more records than other nodes.

Modulus – frequently used
Performs the same functionality as the Hash partition but key field(s) in modulus partition can only be a numeric field. The Modulus of the numeric field is calculated and partitioning is done based on that value.

Random: – less frequent use
Records are randomly distributed across all processing nodes.
Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition.
The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.

Range: – rarely used
It divides a dataset into approximately equal-sized partitions, each of which contains records with key columns within a specific range. It guarantees that all records with same partitioning key values are assigned to the same partition.
Note: In order to use a Range partitioner, a range map has to be made using the ‘Write range map’ stage.

Round robin: – frequently used
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over.
This method is useful for resizing partitions of an input data set that are not equal in size.
The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.

Same: – frequently used
In this partitioning method, records stay on the same processing node as they were in the previous stage; that is, they are not redistributed. Same is the fastest partitioning method.
This is normally the method DataStage uses when passing data between stages in your job(when using “Auto partition”).
Note – It implements the same Partitioning method that is used in the previous stage.

DataStage lab

Tuesday, 21 May 2013

Parallel Processing & Partition Techniques

Parallel Processing

Partitioning Techniques

No comments:

Post a Comment