Top IBM DataStage Interview Questions and Answers

Last updated on Feb 18 2022
Anudhati Reddy

Table of Contents

Top IBM DataStage Interview Questions and Answers

What is DataStage?

  1. DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.

What is a conductor node in DataStage?

  1. Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and an individual player process for each uncombined operator.

Whenever we want to kill a process, we should have to destroy the player process and then section leader process and then conductor process.

Can you explain difference between sequential file, dataset and fileset?

Sequential File:

  1. Extract/load from/to see file max 2GB
  2. When used as a source at the time of compilation it will be converted into native format from ASCII
  3. Does not support null values
  4. Se file can only be accessed on one node.

Dataset:

  1. It preserves partition.it stores data on the nodes so when you read from a dataset you dont have to repartition the data
  2. It stores data in binary in the internal format of datastage. so it takes less time to read/write from ds to any other source/target.
  3. You cannot view the data without datastage.
  4. It Creates 2 types of file to storing the data.
    • Descriptor File: Which is created in defined folder/path.
    • Data File: Created in Dataset folder mentioned in configuration file.
  5. Dataset (.ds. file cannot be open directly, and you could follow alternative way to achieve that, Data Set Management, the utility in client tool(such as Designer and Manager., and command line ORCHADMIN.

Fileset:

  1. It stores data in the format similar to that of sequential file.Only advantage of using fileset over se file is it preserves partition scheme.
  2. you can view the data but, in the order, defined in partitioning scheme.
  3. Fileset creates .fs file and .fs file is stored as ASCII format, so you could directly open it to see the path of data file and its schema.

What are the features of DataStage Flow Designer?

  1. DataStage Flow Designer Features:
  • IBM DataStage Flow Designer has many features to enhance your job building experience.
  • We can use the palette to drag and drop connectors and operators on to the designer canvas.
  • We can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes.
  • We can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab.
  • We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job.
  • This is very useful when you have a very large job with tens or hundreds of stages.

What is HBase connector?

HBase connector is used to connect to tables stored in the HBase database and perform the following operations:

  • Read data from or write data to HBase database.
  • Read data in parallel mode.
  • Use HBase table as a lookup table in sparse or normal mode.

What is Hive connector?

Hive connector supports modulus partition mode and minimum maximum partition mode during the read operation.

What is File connector?

  1. File connector has been enhanced with the following new capabilities:
  • Native HDFS FileSystem mode is supported.
  • You can import metadata from the ORC files.
  • New data types are supported for reading and writing the Paruet formatted files: Date / Time and Timestamp.

What are the different Tiers available in InfoSphere Information Server?

In InfoSphere information server there are four tiers are available, they are:

  1. Client Tier
  2. Engine Tier
  3. Services Tier
  4. Metadata Repository Tier

What is Engine tier in Information server?

The engine tier includes the logical group of components (the InfoSphere Information Server engine components, service agents, and so on. and the computer where those components are installed. The engine runs jobs and other tasks for product modules.

What is Services tier in Information server?

  1. The services tier includes the application server, common services, and product services for the suite and product modules, and the computer where those components are installed. The services tier provides common services (such as metadata and logging. and services that are specific to certain product modules. On the services tier, WebSphere® Application Server hosts the services. The services tier also hosts InfoSphere Information Server applications that are web-based.

What are the key elements of DataStage?

  1. DataStage provides the elements that are necessary to build data integration and transformation flows.

These elements include

  • Stages
  • Links
  • Jobs
  • Table definitions
  • Containers
  • Sequence jobs
  • Projects

What are Links in DataStage?

  1. A link is a representation of a data flow that joins the stages in a job. A link connects data sources to processing stages, connects processing stages to each other, and also connects those processing stages to target systems. Links are like pipes through which the data flows from one stage to the next.

What are Jobs in DataStage?

  1. Jobs include the design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then load that data into a target system. Jobs are created within a visual paradigm that enables instant understanding of the goal of the job.

What are Containers in DataStage?

  1. Containers are reusable objects that hold user-defined groupings of stages and links. Containers create a level of reuse that allows you to use the same set of logic several times while reducing the maintenance. Containers make it easy to share a workflow, because you can simplify and modularize your job designs by replacing complex areas of the diagram with a single container.

What are Projects in DataStage?

  1. A project is a container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.

What are the types of parallel processing?

  1. InfoSphere DataStage jobs use two types of parallel processing:
  2. Data pipelining
  3. Data partitioning

What is Data pipelining?

  1. Data pipelining is the process of extracting records from the data source system and moving them through the sequence of processing functions that are defined in the data flow that is defined by the job. Because records are flowing through the pipeline, they can be processed without writing the records to disk.

What are Operators in DataStage?

  1. A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.

What is OSH in DataStage?

  1. OSH is the scripting language used internally by the parallel engine.

What are Players in DataStage?

  1. Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file.

What are the two major ways of combining data in an Infosphere DataStage Job? How do you decide which one to use?

  1. the two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a Join stage
  2. Lookup and Join stages perform equivalent operations: combining two or more input data sets based on one or more specified keys. When one unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution.
  3. The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. The Lookup stage requires all but the first input (the primary input. to fit into physical memory.

How do you import and export data into Datastage?

  1. Here are the points how to import and export data into datastage
  • The import/export utility consists of these operators:
  • The import operator: imports one or more data files into a single data set.
  • The export operator: exports a data set to one or more data files.

What are the collectors available in collection library?

  1. The collection library contains three collectors:
  2. The ordered collector
  3. The round robin collector
  4. The sort merge collector

What is the round robin collector?

  1. The round robin collector reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the collector starts over. After reaching the final record in any partition, the collector skips that partition. .

What is the sort merge collector?

  1. The sort merge collector reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys.

What is aggtorec restructure operator and what it does?

aggtorec restructure operator groups records that have the same key-field values into an output record

What is field export restructure operator and what it does?

field_export restructure operator combines the input fields specified in your output schema into a string- or raw-valued field

What is promotesubrec restructure operator and what it does?

promotesubrec restructure operator converts input subrecord fields to output top-level fields

What is splitsubrec restructure operator and what it does?

splitsubrec restructure operator separates input subrecords into sets of output top-level vector fields

What is tagbatch restructure operator and what it does?

  1. tagbatch restructure operator converts tagged fields into output records whose schema supports all the possible fields of the tag cases.

What is tagswitch restructure operator and what it does?

  1. The contents of tagged aggregates are converted to InfoSphere DataStage-compatible records.

How do you print/display the first line of a file?

The easiest way to display the first line of a file is using the [head] command.

1 $> head -1 file.txt

If you specify [head -2] then it would print first 2 records of the file.

Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for various text manipulation purposes like this.

1 $> sed ‘2,$ d’ file.txt
  1. How do you print/display the last line of a file?
  2. The easiest way is to use the [tail] command.
1 $> tail -1 file.txt

If you want to do it using [sed] command, here is what you should write:

1 $> sed -n ‘$ p’ test

How to remove the first line / header from a file?

We already know how [sed] can be used to delete a certain line from the output – by using the’d’ switch. So, if we want to delete the first line the command should be:

1 $> sed ‘1 d’ file.txt

But the issue with the above command is, it just prints out all the lines except the first line of the file on the standard output. It does not really change the file in-place. So if you want to delete the first line from the file itself, you have two options.

Either you can redirect the output of the file to some other file and then rename it back to original file like below:

1

2

3

$> sed ‘1 d’ file.txt > new_file.txt

 

$> mv new_file.txt file.txt

Or, you can use an inbuilt [sed] switch ‘–i’ which changes the file in-place. See below:

1 $> sed –i ‘1 d’ file.txt

Mention DataStage characteristics.

Criteria Characteristics
Support for Big Data Hadoop Access Big Data on a distributed file system, JSON support, and JDBC integrator
Ease of use Improve speed, flexibility, and efficacy for data integration
Deployment On-premise or cloud as the need dictates

What is IBM DataStage?

DataStage is an extract, transform, and load tool that is part of the IBM Infosphere suite. It is a tool that is used for working with large data warehouses and data marts for creating and maintaining a data repository.

How is merging done in DataStage?

In DataStage, merging is done when two or more tables are expected to be combined based on their primary key column.

What are data and descriptor files?

Both these files are serving different purposes in DataStage. A descriptor file contains all the information or description, while a data file is the one that just contains data.

What is a routine in DataStage?

DataStage Manager defines a collection of functions within a routine. There are basically three types of routines in DataStage, namely, job control routine, before/after subroutine, and transform function.

What is job control in DataStage?

This tool is used for controlling a job or executing multiple jobs in a parallel manner. It is deployed using the Job Control Language within the IBM DataStage tool.

How to do DataStage jobs performance tuning?

First, we have to select the right configuration files. Then, we need to select the right partition and buffer memory. We have to deal with the sorting of data and handling null-time values. We need to try to use modify, copy, or filter instead of the transformer. Reduce the propagation of unnecessary metadata between various stages.

What is the process for removing duplicates in DataStage?

Duplicates in DataStage can be removed using the sort function. While running the sort function, we need to specify the option which allows for duplicates by setting it to false.

Compare massive parallel processing with symmetric multiprocessing.

In massive parallel processing, many computers are present in the same chassis. While in the symmetric multiprocessing, there are many processors that share the same hardware resources. Massive parallel processing is called ‘shared nothing’ as there is no aspect between various computers. And it is faster than the symmetric multiprocessing.

How can we kill a DataStage job?

To kill a DataStage job, we need to first kill the individual processing ID so that this ensures that the DataStage is killed.

Explain the feature of data type conversion in DataStage.

If we want to do data conversion in DataStage, then we can use the data conversion function. For this to be successfully executed, we need to ensure that the input or the output to and from the operator is the same, and the record schema needs to be compatible with the operator.

What are the various types of lookups in DataStage?

There are different types of lookups in DataStage. These include normal, sparse, range, and caseless lookups.

When do we use a parallel job and a server job?

Using the parallel job or a server job depends on the processing need, functionality, time to implement, and cost. The server job usually runs on a single node, it executes on a DataStage Server Engine and handles small volumes of data. The parallel job runs on multiple nodes; it executes on a DataStage Parallel Engine and handles large volumes of data.

What is Usage Analysis in DataStage?

If we want to check whether a certain job is part of the sequence, then we need to right-click on the Manager on the job and then choose the Usage Analysis.

How to find the number of rows in a sequential file?

For counting the number of rows in a sequential file, we should use the @INROWNUM variable.

What is the difference between a sequential file and a hash file?

The hash file is based on a hash algorithm, and it can be used with a key value. The sequential file, on the other hand, does not have any key-value column. The hash file can be used as a reference for a lookup, while a sequential file cannot be used for a lookup. Due to the presence of the hash key, the hash file is easier to search than a sequential file.

How do we clean a DataStage repository?

For cleaning a DataStage repository, we have to go to DataStage Manager > Job in the menu bar > Clean Up Resources.

If we want to further remove the logs, then we need to go to the respective jobs and clean up the log files.

What is the difference between an Operational DataStage and a Data Warehouse?

An Operational DataStage can be considered as a staging area for real-time analysis for user processing; thus, it is a temporary repository. Whereas, the data warehouse is used for long-term data storage needs and has the complete data of the entire business.

What does NLS mean in DataStage?

NLS means National Language Support. This means we can use this IBM DataStage tool in various languages like multi-byte character languages (Chinese or Japanese. We can read and write in any language and process it as per the requirement.

What are the components of DataStage?

DataStage has the number of client and server components. It has four main components, namely:

  • DataStage Designer
  • DataStage Director
  • DataStage Manager
  • DataStage Administrator

What are the jobs available in DataStage?

  • Server job
  • Parallel job
  • Sequencer job
  • Container job

What command line functions are used to import and export the DataStage jobs?

dsimport.exe command-line function is used to import the DataStage components

dsexport.exe command-line function is used to export the DataStage components

Did you use the conditional scheduling in your project? What is the use of it?

Once during my project, I have used conditional scheduling. Using Sequencer Job we can create conditional scheduling.

What is meant by NLS in DataStage?

NLS is referred to as National Language Support

Explain integrity stage?

Integrity stage is also known as the quality stage that helps in assistance in integrating various types of data from different sources.

What is the need of a link partitioner and link collector in DataStage?

In DataStage, Link Partitioner is used to split data into various parts by certain partitioning methods. Link Collector is used to collect data from many partitions to a single data and save it in the target table.

How will you declare a constraint in DataStage?

Constraints are expressions and you can specify a constraint for each output link from a transformer stage. You can also specify that a particular link is to act as an otherwise link and catch those rows that have failed to satisfy the constraints on all other output links.

  • Select an output link and click the constraints button
  • Double-click the output link’s constraint entry field
  • Choose Constraints from the background or header shortcut menus
  • A dialog box will appear which allows you either to define constraints for any of the transformer output links or to define a link as an otherwise link

Explain Types of Fact Tables?

Factless Fact: It contains only foreign keys to the dimension tables.

Additive Fact: Measures can be added across any dimensions.

Semi-Additive: Measures can be added across some dimensions. Eg. % age, discount

Non-Additive: Measures cannot be added across any dimensions. Eg. Average

Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are measured across the dimensions with the same set of measures.

What are Macros?

They are built from DataStage functions and do not require arguments. A number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer stages., job control routines, filenames and table names, and before/after subroutines.

What is Conformed Dimension?

In DataStage, a conformed dimension is a dimension that has the same meaning to every fact with what it relates. It can exist as a single dimension table that relates to multiple fact tables within the same data warehouse, or as identical dimension tables in separate data marts.

Difference between Validated Ok and Compiled in DataStage

Validating a job is all about running the job in ‘Check only’ mode.

The checks that will perform are-

  • The connections are established to data sources or data warehouse
  • The SELECT statements are prepared
  • Intermediate files are opened in Hashed file, Universe or ODBC stages

 

The compilation is done by-

  • Checking data stage engine to determine whether all needed properties are given
  • Upon selecting Validate Job, the data stage checks the validity of all given properties

What is DataStage?

DataStage is basically a tool that is required for designing, developing, and executing multiple apps to fill different tables in a data mart or a data warehouse. It is a program majorly designed for Windows Servers extracting data from databases and converts them to data warehouses. Today, it is considered as an essential part of the IBM WebSphere Data Integration suite.

Name the command line function that is used to import DS jobs.

To import DS jobs, the dsimport.exe command is used.

Explain the process for populating a source file in DataStage.

You may utilize two techniques for populating a source file in DataStage:

  • The source file can be populated by creating a SQL file in Oracle.
  • The source file can be populated using a row generator extract tool.

How DataStage versions 7.0 and 7.5 are different?

DataStage 7.5 is an advanced version of DataStage 7.0 where multiple stages are added for a smooth or robust performance like Command Stage, Procedure Stage, and Report generation stage, etc.

What is the process for fixing truncated data errors in DataStage tool?

You should use an environment variable for fixing data errors in DataStage tool.

IMPORT_REJECT_STRING_FIELD_OVERRUN

What is the mechanism for writing parallel routines in DataStage?

With the help of C and C++ Compiler, parallel routines can be written in DataStage. You can also use the DS manager for creating parallel routines in DataStage. They can be called further through Transformation Stage.

Do you know the process for improving DataStage jobs?

To improve the performance of DataStage jobs, you should first define the baselines. Secondly, try using multiple flows for performance testing. Then start working in increments. Now, evaluate the data skews. Now you should isolate and solve the problem one by one. This is the time for distributing file systems to remove bottlenecks. RDBMS should not be used in the beginning. In the need, you should understand and assess the available tuning knobs.

How are these three terms different – Merge, Join, and Lookup stage?

All three terms are different in terms of memory storage, input requirements, how they treat the records. Lookup stage needs high memory when compared to Merge and Join.

What is the process of killing a job in DataStage?

To kill a job, you should kill the respective processing ID.

Do you know the format for Date conversion in DataStage?

We can use date conversion function for this purpose i.e. Oconv(Iconv(filedname, “Existing date formate”., “Another date Format”..

What is the meaning of APT_CONFIG command in DataStage?

An environment variable is used for identifying apt files in DataStage. It can be used for storing node information, disk information, etc. from scratch.

Can you convert a server job to a parallel job?

With the help of an IPC Connector and Link Connector in DataStage, a server job can be converted to the parallel job.

How Operational DataStage (ODS.) is different from a Data Warehouse?

ODS can be defined as the mini data warehouse that doesn’t store information for more than a year. At the same time, a Data Warehouse contains detailed information about the entire business.

Do you know how to drop the index before data is loaded to the target DataStage tool?

We can use Direct Load functionality to drop the index before data is loaded to the target DataStage tool.

How DataStage supports Slowly changing dimensions?

In DataStage 8.5 version, this function is available.

What is the process for finding bugs in a job sequence?

It can be done through the DataStage director.

What is a project in DataStage?

To launch a DataStage client, you should connect to the DataStage project first. A DataStage project contains DataStage jobs, built-in DataStage components, and user-defined functions.

Name different types of hash files in DataStage.

There are two types of hash files in DataStage. These are static has files and dynamic hash files. A static hash file is used when a limited amount of data is loaded to the target database. A dynamic hash file is used when we don’t know the exact amount of data to be transferred to the target database.

How are DataStage and DataStage TX are different?

DataStage is a tool from ETL and DataStage TX is a tool from EAI.

How will you define a surrogate key in DataStage?

A surrogate key is used to retrieve data faster. It uses indexes for retrieval operations.

What is the difference between DataStage 7.5 and 7.0?

In DataStage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc.

In DataStage, how you can fix the truncated data error?

The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.

Define Merge?

Merge means to join two or more tables. The two tables are joined on the basis of Primary key columns in both the tables.

Define Routines and their types?

Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines.

How can you write parallel routines in DataStage PX?

We can write parallel routines in C or C++ compiler. Such routines are also created in DS manager and can be called from transformer stage.

Explain quality stage?

Quality stage is also known as Integrity stage. It assists in integrating different types of data from various sources.

Define Job control?

Job control can be best performed by using Job Control Language (JCL. This tool is used to execute multiple jobs simultaneously, without using any kind of loop.

Differentiate between Symmetric Multiprocessing and Massive Parallel Processing?

In Symmetric Multiprocessing, the hardware resources are shared by processor. The processor has one operating system and it communicates through shared memory. While in Massive Parallel processing, the processor accesses the hardware resources exclusively. This type of processing is also known as Shared Nothing, since nothing is shared in this. It is faster than the Symmetric Multiprocessing.

Differentiate between validated and Compiled in the DataStage?

In DataStage, validating a job means, executing a job. While validating, the DataStage engine verifies whether all the required properties are provided or not. In other case, while compiling a job, the DataStage engine verifies that whether all the given properties are valid or not.

How to manage date conversion in DataStage?

We can use date conversion function for this purpose i.e. Oconv(Iconv(Filedname,”Existing Date Format”.,”Another Date Format”..

Why do we use exception activity in DataStage?

All the stages after the exception activity in DataStage are executed in case of any unknown error occurs while executing the job sequencer.

Define OConv (. and IConv (. functions in Datastage?

In Datastage, OConv (. and IConv(. functions are used to convert formats from one format to another i.e. conversions of roman numbers, time, date, radix, numeral ASCII etc. IConv (. is basically used to convert formats for system to understand. While, OConv (. is used to convert formats for users to understand.

How do you find the number of rows in a sequential file?

To find rows in sequential file, we can use the System variable @INROWNUM.

Differentiate between Hash file and Sequential file?

The only difference between the Hash file and Sequential file is that the Hash file saves data on hash algorithm and on a hash key value, while sequential file doesn’t have any key value to save the data. Basis on this hash key feature, searching in Hash file is faster than in sequential file.

NLS stands for what in DataStage?

NLS means National Language Support. It can be used to incorporate other languages such as French, German, and Spanish etc. in the data, required for processing by data warehouse. These languages have same scripts as English language.

Can you explain how could anyone drop the index before loading the data in target in DataStage?

In DataStage, we can drop the index before loading the data in target by using the Direct Load functionality of SQL Loaded Utility.

Define Meta Stage

In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis.

Have you have ever worked in UNIX environment and why it is useful in Datastage?

Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc.

How many types of views are there in a Datastage Director?

There are three types of views in a Datastage Director i.e. Job View, Log View and Status View.

Why we use surrogate key?

In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation.

Differentiate between ODBC and DRS stage?

DRS stage is faster than the ODBC stage because it uses native databases for connectivity.

Define Orabulk and BCP stages?

Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.

Define DS Designer?

The DS Designer is used to design work area and add various links to it.

Why do we use Link Partitioner and Link Collector in DataStage?

In Datastage, Link Partitioner is used to divide data into different parts through certain partitioning methods. Link Collector is used to gather data from various partitions/segments to a single data and save it in the target table.

Highlight the main features of Datastage?

The main features of Datastage are highlighted below:

  • It is the data integration component of the IBM Infosphere information server.
  • It is a GUI based tool. We just need to drag and drop the Datastage objects and we can convert it to Datastage code.
  • It is used to perform the ETL operations (Extract, Transform, Load.
  • It provides connectivity to multiple sources & multiple targets at the same time
  • It provides partitioning and parallels processing techniques that enable the Datastage jobs to process a huge volume of data quite faster.
  • It has enterprise-level connectivity.

What are the main differences you have observed between 7.x and 8.x version of DataStage?

Here are the main differences between both the versions

7.x 8.x
7.x version was platform dependent This version is platform independent
It has 2-tier architecture where datastage is built on top of Unix server It has 3-tier architecture where we have UNIX server database at the bottom then XMETA database which acts as a repository and then we have datastage on top.
There is no concept of parameter set We have parameter sets which can be used anywhere in the project.
We had designer and manager as two separate clients In this version, the manager client was merged into designer client
We had to manually search for the jobs in this version Here we have quick find option in the repository where we can search easily for the jobs.

What are the different layers in the information server architecture?

Below are the different layers of information server architecture

  • Unified user interface
  • Common services
  • Unified parallel processing
  • Unified Metadata
  • Common connectivity

What could be a data source system?

It could be a database table, a flat-file, or even an external application like people soft.

On which interface you will be working as a developer?

As a Datastage developer, we work on the Datastage client interface which is known as a Datastage designer that needs to be installed on the local system. In the backend, it is connected to the Datastage server.

How do you start developing a DataStage project?

The very first step is to create a Datastage job on the Datastage server. All the Datastage objects that we create are stored in the Datastage project. A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines.

A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines.

What is a DataStage job?

The Datastage job is simply a DataStage code that we create as a developer. It contains different stages linked together to define data and process flow.

Stages are nothing but the functionalities that get implemented.

For Example: Let’s assume that I want to do a sum of the sales amount. This can be a ‘group by’ operation that will be performed by one stage.

Now, I want to write the result to a target file. So, this operation will be performed by another stage. Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.

Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.

bi pic2

What are the steps needed to create a simple basic DataStage job?

Click on the File -> Click on New -> Select Parallel Job and hit Ok. A parallel job window will open up. In this Parallel job, we can put together different stages and define the data flow between them.  The simplest DataStage job is an ETL job.

In this, we first need to extract the data from the source system for which we can use either a file stage or database stage because my source system can either be a database table or a file.

Suppose we are reading data from a text file. In this case, we will drag and drop the ‘Sequential File’ stage to the parallel job window. Now, we need to perform some transformation on top of this data. We will use the ‘Transformer’ stage which is available under the Processing category. We can write any logic under the Transformer stage.

Finally, we need to load the processed data to some target table. Let’s say my target database is DB2. So, for this, we will select the DB2 connector stage. Then we will be connecting these data states through sequential links.

After this, we need to configure the stages so that they point to the correct filesystem or database.

For example, For the Sequential file stage, we need to define the mandatory parameters like the file name, file location, column metadata.

Then we need to compile the Datastage job. Compiling the job checks for the syntax of the job and creates an executable file for the Datastage job that can be executed at run time.

In a batch if a job fails in between and you want to restart the batch from that particular job and not from the scratch then what will you do?

In Datastage, there is an option in job seuence – ‘Add checkpoints so the sequence is restart-able on failure’. If this option is checked, then we can rerun the job sequence from the point where it failed.

What are routines in Datastage? Enlist various types of routines.

Routine is a set of functions that are defined by the DS manager. It is run via the transformer stage.

There are 3 kinds of routines:

  • Parallel routines
  • Mainframe routines
  • Server routines

bi pic3

How do you remove duplicate values in DataStage?

Answers: There are two ways to handle duplicate values

  • We can use remove duplicate stage to eliminate duplicates.
  • We can use the Sort stage to remove duplicate. The sorting stage has a property called ‘allow duplicates’. We won’t get duplicate values in the output of sort on setting this property equal to false.

What are the different kinds of views available in a DataStage director?

Answers: There are 3 kinds of views available in the DataStage director. They are:

  • Log view
  • Status view
  • Job view

Distinguish between Informatica & DataStage. Which one would you choose and why?

Answers: Both Informatica and DataStage are powerful ETL tools.

Enlisted points differentiate between both the tools:

  Informatica Datastage
Parallel Processing Informatica does not support parallel processing. In contrast to this, datastage provides mechanism for parallel processing.
Implementing SCDs It is quite simple to implement SCDs (Slowly changing dimensions. in Informatica. However, it is complex to implement SCDs in datastage. Datastage supports SCDs merely through custom scripts.
Version Control Informatica supports version controlling through check-in and check-out of objects. However, we don’t have this functionality available in datastage.
Available Transformations Lesser transformations are available. Datastage offers more variety of transformations than Informatica.
Power of lookup Informatica provides very powerful dynamic cache lookup We don’t have any similar thing in datastage.

In my personal opinion, I would go with Informatica over Datastage. The reason being I have found Informatica more systematic and user-friendly than DataStage.

Another strong reason is that debugging and error handling is much better in Informatica as compared to Datastage. So, fixing issues become easier in Informatica. Datastage does not provide complete error handling support.

What is the difference between passive stage and active stage?

Passive stages are utilized for extraction and loading whereas active stages are utilized for transformation.

What are the various kinds of containers available in Datastage?

We have below 2 containers in Datastage:

  • Local container
  • Shared container

What are the different types of jobs in Datastage?

We have two types of jobs in Datastage:

  • Server jobs (They run in a sequential manner.
  • Parallel jobs (They get executed in a parallel way.

What is the use of Datastage director?

Through Datastage director, we can schedule a job, validate the job, execute the job and monitor the job.

What are the various kinds of the hash file?

We have 2 types of hash files:

  • Static hash file
  • Dynamic hash file

What is a quality stage?

The quality stage (also called as integrity stage. is a stage that aids in combining the data together coming from different sources.

What are the characteristics of DataStage?

The characteristics of IBM DataStage are as follows:

  • It can be deployed on local servers as well as the cloud as per the need and requirement.
  • It is effortless to use and can increase the speed and flexibility of data integration efficiently.
  • It supports big dataand can access big data in many ways, such as JDBC integrator, JSON support, and distributed file systems.

Describe the DataStage architecture briefly.

IBM DataStage follows a client-server model as its architecture and has different architecture types for its various versions. The components of the client-server architecture are:

 

  1. Client components
  2. Servers
  3. Stages
  4. Table definitions
  5. Containers
  6. Projects
  7. Jobs

How can we run a job using the command line in DataStage?

The command is: dsjob -run -jobstatus <projectname> <jobname>

  1. List a few functions that we can execute using the ‘dsjob’ command.

The different functions that we can perform using $dsjob command are:

    1. $dsjob -run: It is used to run the DataStage job
    2. $dsjob -stop: It is used to stop the job that is currently present in the process
    3. $dsjob -jobid: It is used for providing the job information
    4. $dsjob -report: It is used for displaying the complete job report
    5. $dsjob -lprojects: It is used for listing all the projects that are present
    6. $dsjob -ljobs: It is used for listing all the jobs that are present in the project
    7. $dsjob -lstages: It is used for listing all the stages of the current job
    8. $dsjob -llinks: It is used for listing all the links
    9. $dsjobs -lparams: It is used for listing all the parameters of the job
    10. $dsjob -projectinfo: It is used for retrieving the information about the project
    11. $dsjob -jobinfo: It is used for the information retrieval of the job
    12. $dsjob -stageinfo: It is used for the information retrieval of that stage of that job
    13. $dsjob -linkinfo: It is used for getting the information of that link
    14. $dsjob -paraminfo: It provides the information of all the parameters
    15. $dsjob -loginfo: It is used for getting the information about the log
    16. $dsjob -log: It is used for adding a text message in the log
    17. $dsjob -logsum: It is used for displaying the log data
    18. $dsjob -logdetail: It is used for displaying all the details of the log
    19. $dsjob -lognewest: It is used for retrieving the id of the newest log

What are the main features of the flow designer?

The main features of the flow designer are:

  1. It is very useful to perform jobs with a large number of stages.
  2. There is no need to migrate the jobs to use the flow designer.
  3. We can use the provided palette to add and remove connectors and operators on the designer canvas using the drag and drop feature.

How to convert a server job to a parallel job in DataStage?

A server job can be converted to a parallel job using a Link collector and an IPC collector.

What is an HBase connector?

An HBase connector in DataStage is a tool used to connect databases and tables present in the HBase database. It is majorly used to perform the following tasks:

  1. Read and write data from and to the HBase
  2. Reading data in the parallel mode.
  3. Using HBase as a view table

What is Infosphere in DataStage?

The infosphere information server is capable of managing high volume requirements of the companies and delivers high-quality and faster results. It provides the companies with a single platform for managing the data where they can understand, clean, transform, and deliver enormous amounts of information.

List all the different tiers of InfoSphere Information Server?

The different tiers of the InfoSphere Information Server are:

  1. Client tier
  2. Services tier
  3. Engine tier
  4. Metadata Repository tier

Describe the Services tier of Infosphere Information Server briefly.

The services tier of the Infosphere Information Server is used for providing standard services like metadata and logging and some other module-specific services. It contains an application server, various product modules, and other product services.

Describe the Engine tier of Infosphere Information Server briefly.

The engine tier of the Infosphere Information Server is a set of logical components used to run the jobs and other tasks for the product modules.

What are the types of parallel processing in the DataStage?

There are two different types of parallel processing, which are:

  1. Data Partitioning
  2. Data Pipelining

What is Data Partitioning?

Data partitioning is a type of parallel approach for data processing. It involves the process of breaking down the records into partitions for the processing. It increases the efficiency of processing in a linear model.

What are Players?

Players in DataStage are the workhorse processes. They help us perform the parallel processing and are assigned to the operators on each node.

What is a collection library in the DataStage?

The collection libraries are the set of operators and are used to collect the partitioned data.

What are the types of collectors available in the collection library of DataStage?

The types of collectors available in the collection library are:

  1. Sortmerg collector
  2. Roundrobin collector
  3. Ordered collector

State the difference between Server job and parallel job

Server Jobs work in a sequential way while parallel jobs work in parallel fashion (Parallel Extender works on the principle of pipeline and partition. for I/O processing.

At which location DataStage repository is stored?

DataStage stores its repository in IBM Universe Database.

Differentiate between DataStage and Informatica.

  • DataStage supports parallel processing which Informatica doesn’t.
  • Links are objects in the DataStage, but in Informatica, it’s a port to port connectivity.
  • In Informatica, it is easy to implement Slowly Changing Dimensions which is a little bit complex in DataStage.
  • DataStage doesn’t support complete error handling.

Define universal stage

A stage that extracts data from or loads data into a Universe database using SQL. It is used to represent a data source, an aggregation step, or a target data table.

What is a dynamic array in DataStage?

Dynamic arrays map the structure of DataStage file records to character string data. Any character string can be a dynamic array. A dynamic array is a character string containing elements that are substrings separated by delimiters.

State staging variable

Staging variables are the temporary variables created in the transformer for calculation.

How can u convert the columns to rows in DataStage?

The conversion of columns into rows can be done in DataStage utilizing pivot stage.

What is the merge stage?

The Merge stage combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record.

State a few advantages of DataStage.

A data warehousing strategy has the following advantages:

  • Capitalizes on the potential value of the organization’s information.
  • Improves the quality and accessibility of data.
  • Combines valuable archive data with the latest data in operational sources.
  • Increases the amount of information available to users.
  • Reduces the requirement of users to access operational data.
  • Reduces the strain on IT departments, as they can produce one database to serve all user groups.
  • Allows new reports and studies to be introduced without disrupting operational systems.
  • Promotes users to be self-sufficient.

What do you mean by data repository?

MetaData is a data about the data. It contains-

  • query statistics
  • ETL statistics
  • Business subject area
  • Source information
  • Target information
  • Source to target mapping Information

Describe Oconv (. And Iconv (. Functions in Datastage?

In Datastage, OConv (. and IConv(. functions are used to convert formats from one format to another i.e. conversions of time, roman numbers, radix, date, numeral ASCII etc. IConv (. is mostly used to change formats for the system to understand. While, OConv (. is used to change formats for users to understand.

Explain BCP stage?

The BCP stage is used to store a big amount of data in one target table of Microsoft SQL Server.

Explain the DataStage parallel Extender or Enterprise Edition (EE)?

  1. Parallel extender in DataStage is the data extraction and transformation application for the parallel processing.

There are two types of parallel processing’s are available they are:

  1. Pipeline Parallelism
  2. Partition Parallelism

How do you run DataStage job from command line?

Using “dsjob” command as follows.

1 dsjob -run -jobstatus projectname jobname

What are the different options associated with “dsjob” command?

ex:$dsjob -run  and also the options like

-stop -To stop the running job

-lprojects – To list the projects

-ljobs – To list the jobs in project

-lstages – To list the stages present in job.

-llinks – To list the links.

-projectinfo – returns the project information(hostname and project name.

-jobinfo – returns the job information(Job status,job runtime,endtime, etc.,.

-stageinfo – returns the stage name ,stage type,input rows etc.,.

-linkinfo – It returns the link information

-lparams – To list the parameters in a job

-paraminfo – returns the parameters info

-log – add a text message to log.

-logsum – To display the log

-logdetail – To display with details like event_id,time,messge

-lognewest – To display the newest log id.

-report – display a report contains Generated time, start time,elapsed time,status etc.,

-jobid – Job id information.

What is IBM DataStage Flow Designer?

  1. IBM DataStage Flow Designer is a web-based user interface for DataStage. You can use it to create, edit, load, and run DataStage jobs.

What are the benefits of Flow Designer?

  1. There are many benefits with Flow designer, they are:
  • No need to migrate jobs– You do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface.
  • No need to upgrade servers and purchase virtualization technology licenses– Getting rid of a thick client means getting rid of keeping up with the latest version of software, upgrading servers, and purchasing Citrix licenses. IBM DataStage Flow Designer saves time AND money!
  • Easily work with your favorite jobs– You can mark your favorite jobs in the Jobs Dashboard, and have them automatically show up on the welcome page. This gives you a fast, one-click access to jobs that are typically used for reference, saving you navigation time.
  • Easily continue working where you left off– Your recent activity automatically shows up on the welcome page. This gives you a fast, one-click access to jobs that you were working on before, so you can easily start where you left off in the last session.
  • Efficiently search any job– Many organizations have thousands of DataStage jobs. You can very easily find your job with the built-in type ahead Search feature on the Jobs Dashboard.
  • Cloning a job– Instead of always starting Job Design from scratch, you can clone an existing job on the Jobs Dashboard and use that to jump-start your new Job Design.
  • Automatic metadata propagation– IBM DataStage Flow Designer comes with a powerful feature to automatically propagate metadata. Once you add a source connector to your job and link it to an operator, the operator automatically inherits the metadata. You do not have to specify the metadata in each stage of the job.
  • Storing your preferences– You can easily customize your viewing preferences and have the IBM DataStage Flow Designer automatically save them across sessions.
  • Saving a job– IBM DataStage Flow Designer allows you to save a job in any folder. The job is saved as a DataStage job in the repository, alongside other jobs that might have been created using the DataStage Designer thick client.
  • Highlighting of all compilation errors– The DataStage thick client identifies compilation errors one at a time. Large jobs with many stages can take longer to troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at the same time before recompiling.
  • Running a job– IBM DataStage Flow Designer allows you to run a job. You can refresh the status of your job on the new user interface. You can also view the Job Log, or launch the Ops Console to see more details of job execution

What is Kafka connector?

  1. Kafka connector has been enhanced with the following new capabilities:
  • Continuous mode, where incoming topic messages are consumed without stopping the connector.
  • Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.
  • TLS connection to Kafka.
  • Kerberos keytab locality is supported.

What is Amazon S3 connector?

  1. AmazonS3 connector now supports connecting by using a HTTP proxy server.

Explain is Infosphere Information Server?

  1. InfoSphere Information Server is capable of scaling to meet any information volume requirement so that companies can deliver business results faster and with higher quality results. InfoSphere Information Server provides a single unified platform that enables companies to understand, cleanse, transform, and deliver trustworthy and context-rich information.

What is Client tier in Information server?

The client tier includes the client programs and consoles that are used for development and administration, and the computers where they are installed.

Metadata repository tier in Information server?

  1. The metadata repository tier includes the metadata repository, the InfoSphere Information Analyzer analysis database (if installed., and the computer where these components are installed. The metadata repository contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules. The analysis database stores extended analysis data for InfoSphere Information Analyzer.

What are Stages in DataStage?

  1. Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of functionality that performs either a simple or advanced data integration task. Stages represent the processing steps that will be performed on the data.

What are Sequence jobs in DataStage?

  1. A sequence job is a special type of job that you can use to create a workflow by running other jobs in a specified order. This type of job was previously called a job sequence.

What are Table definitions?

  1. Table definitions specify the format of the data that you want to use at each stage of a job. They can be shared by all the jobs in a project and between all projects in Infosphere DataStage. Typically, table definitions are loaded into source stages. They are sometimes loaded into target stages and other stages.

What is Parallel processing design?

  1. Infosphere DataStage brings the power of parallel processing to the data extraction and transformation process. Infosphere DataStage jobs automatically inherit the capabilities of data pipelining and data partitioning, allowing you to design an integration process without concern for data volumes or tim constraints, and without any requirements for hand coding.

What is Data partitioning?

  1. Data partitioning is an approach to parallelism that involves breaking the records into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance.

When you design a job, you select the type of data partitioning algorithm that you want to use (hash, range, modulus, and so on. Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.

What is the advantage of using Modular development in data stage?

  1. We should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time.

What is Link buffering?

  1. InfoSphere DataStage automatically performs buffering on the links of certain stages. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output.

What is the collection library in Datastage?

  1. The collection library is a set of related operators that are concerned with collecting partitioned data.

What is the ordered collector?

  1. The Ordered collector reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the sorted order of an input data set that has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as well as the partitions themselves, are ordered. .

What is field_import restructure operator and what it does?

field_import restructure operator exports an input string or raw field to the output fields specified in your import schema.

What is makesubrec restructure operator and what it does?

makesubrec restructure operator combines specified vector fields into a vector of subrecords

What is makevect restructure operator and what it does?

makevect restructure operator combines specified fields into a vector of fields of the same type

What is splitvect restructure operator and what it does?

splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-named top-level fields

How to display n-th line of a file?

The easiest way to do it will be by using [sed] command

1 $> sed –n ‘ p’ file.txt

You need to replace with the actual line number. So if you want to print the 4th line, the command will be

1 $> sed –n ‘4 p’ test

Of course, you can do it by using [head] and [tail] command as well like below:

1 $> head – file.txt | tail -1

You need to replace with the actual line number. So if you want to print the 4th line, the command will be

1 $> head -4 file.txt | tail -1

How to remove the last line/ trailer from a file in Unix script?

  1. Always remember that [sed] switch ‘$’ refers to the last line. So using this knowledge we can deduce the below command:
1 $> sed –i ‘$ d’ file.txt

How to remove certain lines from a file in Unix?

  1. If you want to remove line to line from a given file, you can accomplish the task in the similar method shown above. Here is an example:
1 $> sed –i ‘5,7 d’ file.txt

The above command will delete line 5 to line 7 from the file file.txt

How is a DataStage source file filled?

We can develop a SQL query or we can use a row generator extract tool through which we can fill the source file in DataStage.

How is DataStage different from Informatica?

DataStage and Informatica are both powerful ETL tools, but there are a few differences between the two. DataStage has parallelism and partition concepts for node configuration; whereas in Informatica, there is no support for parallelism in node configuration. Also, DataStage is simpler to use as compared to Informatica.

What is the difference between join, merge, and lookup stages?

The fundamental difference between these three stages is the amount of memory they take. Other than that, how they treat the input requirement and the various records are also factors that differentiate one another. Based on the memory usage, the lookup stage uses a very less amount of memory. Both lookup and merge stages use a huge amount of memory.

What is the quality state in DataStage?

The quality state is used for cleansing the data with the DataStage tool. It is a client-server software tool that is provided as part of the IBM Information Server.

What is a repository table in DataStage?

The term ‘repository’ is another name for a data warehouse. It can be centralized or distributed. The repository table is used for answering ad-hoc, historical, analytical, or complex queries.

How do we compare the Validated OK with the Compiled Process in DataStage?

The Compiled Process ensures that the important stage parameters are mapped and these are correct such that it creates an executable job. Whereas in the Validated OK, we make sure that the connections are valid.

What is the significance of the exception activity in DataStage?

Whenever there is an unfamiliar error happening while executing the job sequencer, all the stages after the exception activity are run. So, this makes the exception activity so important in the DataStage.

How do we call a routine in DataStage?

Routines are stored in the Routine branch of the DataStage repository. This is where we can create, view, or edit all the Routines. The Routines in DataStage could be the following: Job Control Routine, Before-after Subroutine, and Transform function.

Explain a few features of DataStage.

  • Extracts data from any number or types of database.
  • Handles all the metadata definitions required to define your data warehouse.
  • You can view and modify the table definitions at any point during the design of your application.
  • Aggregates data
  • You can modify SQL SELECT statements used to extract data
  • DataStage transforms data easily. It has a set of predefined transforms and functions.
  • You can use it to convert your data and you can easily extend the functionality by defining your own transforms to use.
  • Loads the data warehouse

How can we eliminate duplicate rows?

DataStage provides us the capability to eliminate duplicate rows in EE (Enterprise Edition. only. Using EE of DataStage, the elimination of duplicate rows can be done based on a key column.

What can we do with the DataStage Director?

Validating, Scheduling, Executing, and Monitoring Jobs (server Jobs.)

What is a Hash file and a Sequential file?

Hash file saves data on a hash algorithm and on a hash key value, while a sequential file doesn’t have any key value to save the data. Based on this hash key feature, searching in a Hash file is faster than in a sequential file.

State the difference between DataStage and DataStage TX?

DataStage is a tool from ETL (Extract, Transform and Load and DataStage TX is a tool from EAI (Enterprise Application Integration.

What is the use of Surrogate key?

In DataStage, Surrogate Key is used instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation. Surrogate key is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table. We can store up to 2 billion records.

The need of Surrogate key is for integrating the data may help better for the primary key. Additionally, it is used for index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

How to execute a DataStage job from the command line?

Using “dsjob”. Following is the command:

dsjob -run -jobstatus projectname jobname

What components are there in DataStage architecture?

DataStage has three main components:

  1. Client component
    1. DataStage Administrator
    2. DataStage Manager
    3. DataStage Designer
    4. DataStage Director
  2. Server component
    • . DataStage Engine
  1. MetaData Repository
  2. Package Installer
  • DataStage administrator
    • . Used to create the project
  1. Contains set of properties

What are the types of dimension tables?

Conformed Dimension: If a dimension table is connected to more than one fact table, the granularity that is defined in the dimension table is common across between the fact tables.

Junk Dimension: The Dimension table, which contains only flags.

Monster Dimension: If rapidly changes occur in dimension then it is known as Monster Dimension.

Degenerate Dimension: It is line item-oriented fact table design.

Name the command line function that is used to export DS jobs.

To export DS jobs, the dsexport.exe command is used.

How data file and descriptor file are different?

A Data file contains data while Descriptor file contains complete information about data stored in data files.

How can DataStage and Informatica be compared?

In DataStage, there is a concept of data partition and data parallelism when it comes to node configuration. While there is no concept of data partition and data parallelism for node configuration. When it comes to benefits, Informatica is more scalable, and DataStage is more user-friendly.

What are Routines and its different types in DataStage?

Routines are basically collections of functions that can be defined with the help of a DS Manager. They can be called further through Transformation Stage. They are basically divided into three major categories:

  • Parallel Routines
  • Server routines
  • Main Frame Routines

How to remove duplicate values in DataStage?

Sort Stage can do it. Here, you should use the option, allow duplicates = False.

What is a Merger?

Merging means joining two tables together. It can be done with the help of a Primary key in both the tables.

What is a quality Stage in DataStage tool?

A quality stage is also termed as the Integrity Stage. It helps in integrating different types of data from multiple sources.

What is the meaning of term Job Control in DataStage tool?

With the Job Control Language (JCL., job control tasks are completed. It is used to execute multiple jobs simultaneously without using any loop.

How to differentiate massive parallel processing and symmetric processing?

Symmetric Processing:

  • In this case, hardware resources are shared by one processor.
  • Each processor has a dedicated operating system that communicates by shared memory.

Massive Parallel Processing:

  • In this case, the processor accesses the hardware resources exclusively.
  • It does not allow sharing and much faster than symmetric processing.

How to validate and compile a job in DataStage?

Validation means the execution of a job in DataStage. When validating a job, DataStage engine verifies either all properties are defined well or not. During compilation, the DataStage engine will check either all defined properties are valid or not.

What is the meaning of the exception activity in DataStage?

The job executes even if there exists some unexpected error in case of exception activity with the help of a job sequencer.

How many types of Lookups exists in DataStage?

There are two types of lookups in DataStage, Normal Lookup and Sparse Lookup. In the case of Normal Lookup, data is first stored in memory, and lookup is performed later. In the case of Sparse lookup, data is saved to the database directly. It is faster than the normal lookup.

What is a repository table in DataStage?

A repository table works as a data warehouse that can either be centralized or distributed.

How will you calculate the total number of rows in a sequential file?

To calculate the total number of rows in a sequential file, we can use a system variable @INROWNUM

Define the usage analysis in DataStage.

The usage analysis can be performed in DataStage with few clicks only. For this purpose, launch the DataStage Manager first then right click on the job, select the usage analysis option and you are done.

What is the meaning of OConv (. and IConv (. functions in DataStage?

They are used to convert files one format to another. The OConv (. function is used to convert the file for users to understand while IConv (. function is used to convert the file for the system to understand. They are used to convert data in different formats, either it is ASCII code, numeric data, date, or time formats, etc.

How will you differentiate a Hash File and a Sequential File?

Hash files store data using Hash algorithms on Hash key values. It does not have any key value to store the data. Because of the Hash key feature in DataStage, search becomes faster for a Hash file as compared to a Sequential File.

How to clean a DataStage repository?

To clean a DataStage repository, you should clean up resources functionality in Database Manager.

What is the meaning of NLS in DataStage?

NLS means national language support that gives support for international languages like Spanish, German, Japanese, and more.

Do you know any third-party tool that can be integrated with DataStage? Have you ever used it during your last job experience?

TNG, Autosys, and Even coordinator are a few third-party tools that can be used with DataStage. Yes, I have worked on all these three tools and having hands-on expertise with all of them.

How can complex jobs be implemented in DataStage to improve the performance?

You should not use more than 20 stages for each job to improve the performance. If the number goes higher than 20, it is better using another job in that case.

What do you understand by the term Meta Stage?

A Meta Stage is used to save the metadata that is further helpful for data analysis and data lineage.

Have you ever worked in a UNIX environment and why it is significant in DataStage?

Yes, I have the knowledge of UNIX environment, and it is significant in DataStage too because one has to write UNIX programs when required.

What is transaction size and array size mean in DataStage?

Transaction size signifies the total number of rows written before records are committed to a table. Array size means the total number of rows written to and from a table.

Name a different type of views in a DataStage director?

Job View, Status view, and Log View.

How can we manage rejected rows in DataStage?

To manage rejected rows, constraints should be defined in DataStage. It would be great defining a temporary storage area for rejected rows in DataStage.

What is a DS Designer?

A DS manager is used to design the work area and add various links to it.

What do you understand by the terms Link collector and Link Partitioner in DataStage?

Link Partitioner is used to divide data into multiple logical units based on partitioning methods. Link collector is used for collecting data from different partitions to a single data and save it to the target database.

How to compare the ODBC and DRS stage?

DRS is much faster than ODBS because it uses native databases for connectivity.

How to compare the BCP and Orabulk stages?

Orabulk is used to store voluminous data in one target table of Oracle database. A BCP stage is used to store voluminous data in one target table of Microsoft SQL Server.

Define Data Stage?

A data stage is basically a tool that is used to design, develop and execute various applications to fill multiple tables in data warehouse or data marts. It is a program for Windows servers that extracts data from databases and change them into data warehouses. It has become an essential part of IBM WebSphere Data Integration suite.

Explain how a source file is populated?

We can populate a source file in many ways such as by creating a SQL query in Oracle, or  by using row generator extract tool etc.

Name the command line functions to import and export the DS jobs?

To import the DS jobs, dsimport.exe is used and to export the DS jobs, dsexport.exe is used.

Differentiate between data file and descriptor file?

As the name implies, data files contain the data and the descriptor file contains the description/information about the data in the data files.

Differentiate between DataStage and informatica?

In datastage, there is a concept of partition, parallelism for node configuration. While, there is no concept of partition and parallelism in informatica for node configuration. Also, Informatica is more scalable than DataStage. Datastage is more user-friendly as compared to Informatica.

What is the method of removing duplicates, without the remove duplicate stage?

Duplicates can be removed by using Sort stage. We can use the option, as allow duplicate = false.

What steps should be taken to improve DataStage jobs?

In order to improve performance of DataStage jobs, we have to first establish the baselines. Secondly, we should not use only one flow for performance testing. Thirdly, we should work in increment. Then, we should evaluate data skews. Then we should isolate and solve the problems, one by one. After that, we should distribute the file systems to remove bottlenecks, if any. Also, we should not include RDBMS in start of testing phase. Last but not the least, we should understand and assess the available tuning knobs.

Differentiate between Join, Merge and Lookup stage?

All the three concepts are different from each other in the way they use the memory storage, compare input requirements and how they treat various records. Join and Merge needs less memory as compared to the Lookup stage.

What are the steps required to kill the job in DataStage?

To kill the job in DataStage, we have to kill the respective processing ID.

Define APT_CONFIG in DataStage?

It is the environment variable that is used to identify the *.apt file in DataStage. It is also used to store the node information, disk storage information and scratch information.

Name the different types of Lookups in DataStage?

There are two types of Lookups in DataStage i.e. Normal lkp and Sparse lkp. In Normal lkp, the data is saved in the memory first and then the lookup is performed. In Sparse lkp, the data is directly saved in the database. Therefore, the Sparse lkp is faster than the Normal lkp.

How a server job can be converted to a parallel job?

We can convert a server job in to a parallel job by using IPC stage and Link Collector.

Define Repository tables in DataStage?

In Datastage, the Repository is another name for a data warehouse. It can be centralized as well as distributed.

Explain Usage Analysis in Datastage?

In Datastage, Usage Analysis is performed within few clicks. Launch Datastage Manager and right click the job. Then, select Usage Analysis and that’s it.

How to clean the DataStage repository?

We can clean the Datastage repository by using the Clean Up Resources functionality in the Datastage Manager.

How a routine is called in DataStage job?

In Datastage, routines are of two types i.e. Before Sub Routines and After Sub Routines. We can call a routine from the transformer stage in Datastage.

Differentiate between Operational DataStage (ODS. and Data warehouse?

We can say, ODS is a mini data warehouse. An ODS doesn’t contain information for more than 1 year while a data warehouse contains detailed information regarding the entire business.

Does Datastage support  SQLowly changing dimensions ?

Yes. Version 8.5 + supports this feature

How can one find bugs in job sequence?

We can find bugs in job sequence by using DataStage Director.

How complex jobs are implemented in DataStage to improve performance?

In order to improve performance in DataStage, it is recommended, not to use more than 20 stages in every job. If you need to use more than 20 stages then it is better to use another job for those stages.

Name the third-party tools that can be used in DataStage?

The third-party tools that can be used in Datastage, are Autosys, TNG and Event Co-ordinator. I have worked with these tools and possess hands on experience of working with these third party tools.

Define Project in DataStage?

Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components.

How many types of hash files are there?

There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file.

Differentiate between Datastage and Datastage TX?

Datastage is a tool from ETL (Extract, Transform and Load. and Datastage TX is a tool from EAI (Enterprise Application Integration..

What is size of a transaction and an array means in a Datastage?

Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.

How rejected rows are managed in Datastage?

In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command.

What is DataStage?

Datastage is an ETL tool given by IBM which utilizes a GUI to design data integration solutions. This was the first ETL tool that gave a parallelism concept.

It is available in following 3 different editions

  • Server Edition
  • Enterprise Edition
  • MVS Edition

What are the primary usages of the DataStage tool?

Datastage is an ETL tool that is primarily used for extracting data from source systems, transforming that data and finally loading it to target systems.

Can you highlight the main features of the IBM Infosphere information server?

The main features of IBM Infosphere information server suite are:

  • It provides a single platform for data integration. It has the capability to connect to multiple source systems as well as write to multiple target systems.
  • It is based on centralized layers. All the components of the suite are able to share the baseline architecture of the suite.
  • It has layers for the unified repository, for integrated metadata services and a common parallel engine.
  • It provides tools for analysis, cleansing, monitoring, transforming and delivering data.
  • It has massively parallel processing capabilities. It turns out the processing to be very fast.

What are the different common services in Datastage?

Below is the list of common services in Datastage:

  • Metadata services
  • Unified service deployment
  • Security services
  • Looping and reporting services.

 

What are DataStage sequences?

Datastage sequence connects the DataStage jobs in a logical flow.

If you want to use the same piece of code in different jobs, how will you achieve this?

This can be done by using shared containers. We have shared containers for reusability. A shared container is a reusable job element consisting of stages and links.  We can call a shared container in different Datastage jobs.

Where do the Datastage jobs get stored?

The Datastage jobs get stored in the repository. We have various folders in which we can store the Datastage jobs.

Where do you see different stages in the designer?

All the stages are available within a window called Palette. It has various categories depending upon the kind of function that the stage provides.

The various categories of stages in the Palette are – General, Data quality, Database, Development, File, Processing, etc.

What are the Processing stages?

The processing stages allow us to apply the actual data transformation.

For example, the ‘aggregator’ stage under the Processing category allows us to apply all the ‘group by’ operations. Similarly, we have other stages in Processing like the ‘Join’ stage that allows us to join together the data coming from two different input streams.

Name the different sorting methods in Datastage.

There are two methods available:

  • Link sort
  • Inbuilt Datastage Sort

How do you import and export the Datastage jobs?

For this, below command-line functions for this

  • Import: dsimport.exe
  • Export: dsexport.exe

Give an idea of system variables.

System variables are the read-only variables beginning with ‘@’ which can be read by either the transformer stage or routine. They are used to get the system information.

Is the value of staging variable stored temporarily or permanently?

Temporarily. It is a temporary variable.

What is IBM DataStage, and why is it used?

DataStage is a tool provided by IBM and used to design, develop, and execute the applications to fill the data into data warehouses by extracting the data from databases from windows servers. It contains the feature of graphic visualizations for data integrations and can also extract data from multiple sources. It is therefore considered one of the most potent ETL tools. DataStage has various versions that companies can use based on their requirements. The versions are Server Edition, MVS Edition, and Enterprise Edition.

What is a flow designer in IBM DataStage?

Flow designer is the web-based user interface of DataStage and is used to create, edit, load, and run the jobs in DataStage.

bi pic4

What is a Hive connector?

Hive connector is a tool that is used to support partition modes while reading the data. It can be done in two ways:

  1. modulus partition mode
  2. minimum-maximum partition mode

Describe the Client tier of the Infosphere Information Server briefly.

The client tier of the Infosphere Information Server is used for the development and the complete administration of the computers using the client programs and consoles.

Describe the Metadata Repository tier of Infosphere Information Server briefly.

The metadata repository tier of the Infosphere Information Server includes the metadata repository, the analysis database, and the computer. It is used to share the metadata, shared data, and configuration information.

What is Data Pipelining?

Data Pipelining is a type of parallel approach for data processing where we perform the extraction of data from the source and then make them pass through a seuence of processing functions to get the reuired output.

What is OSH in DataStage?

OSH is an abbreviation of Orchestrate Shell and is a scripting language used in DataStage internally by the parallel engine.

How is the source file populated in DataStage?

The source file can be populated using SQL queries and also by using the row generator extraction tool.

Who are the DataStage clients or users?

DataStage tool can be used by-

  • DataStage Administrator
  • DataStage Designer
  • DataStage Manager
  • DataStage Director

Explain Metadata.

Data about data is known as metadata. A table definition describing the structure of the table is known as metadata.

State maps and locales

Maps: Maps defines the sets that can be utilized in a project

Locales: Locales define the local formats for date, time, sorting order, etc. that can be utilized in a project.

State the types of server components.

  • Repository
  • DataStage Server
  • DataStage Package Installer

What are the components of Ascential Datastage?

Client Components: Administrator, Director, Manager, and Designer.

Server Components: Repository, Server, and Plugins.

What are the two types of containers used in DataStage?

DataStage has two types of containers – Local Container and Shared Container

What is the hash file stage in DataStage?

Hash file stage is a binary file used for lookup, for better performance.

State the term repository.

A repository is a central store that contains all the information required to build a data mart or data warehouse.

What is the work of a DataStage server?

A DataStage server runs executable jobs that extract, transform, and load data into a data warehouse.

We have two containers in DataStage – local and shared containers, what does it mean?

Local Container is local to the particular job in which we have developed the container and Shared Container can be used in any other jobs.

I have stored my data at orabulk stage, what does it mean?

OraBulk stage is used to Bulk Load the Oracle Target Database. By storing data at the OraBulk stage of DataStage means your data is stored at the Oracle Target Database.

Define Active and Passive stages.

Passive Stage is used for data extraction and loading

Active Stage is used to implement and process the business rules

What is meant by join stage and merge stage?

Join stage is used to perform join operations on one or more data sets input to the stage and then outputs the resulting dataset.

Merge stage is used to combine a sorted master data set with one or more sorted updated data sets. The columns from the records in the master and update data sets arc merged so that the output record contains all the columns from the master record plus any additional columns from each update record that required.

What is the default cache size in DataStage?

The default cache size in DataStage is 256 mb.

Can you reuse the components? If yes, then how?

The components in DataStage can be reused by using shared and local containers.

 

So, this brings us to the end of the IBM DataStage Interview Questions blog.This Tecklearn ‘Top IBM DataStage Interview Questions and Answers’ helps you with commonly asked questions if you are looking out for a job in DataStage or Business Intelligence Domain. If you wish to learn IBM DataStage and build a career in Business Intelligence domain, then check out our interactive, IBM DataStage Training, that comes with 24*7 support to guide you throughout your learning period.

https://www.tecklearn.com/course/ibm-datastage-training/

IBM DataStage Training

About the course

DataStage is an ETL tool which extracts data, transform and load data from source to the target. The data sources might include sequential files, indexed files, relational databases, external data sources, archives, enterprise applications, etc. DataStage facilitates business analysis by providing quality data to help in gaining business intelligence. As part of this program, you will you learn DataStage data integration, ETL, data warehousing and work with data in rest or motion. As part of the training, you will work on real-life projects.

 

Why should you take IBM DataStage Training?

  • The average salary for DataStage developer ranges from approximately $93,344 per year for Developer to $102,090 per year for Senior Developer.-Indeed.com.
  • IBM DataStage has a market share of around 3% globally.
  • With DataStage Enterprise Edition users can use the parallel processing engine which provides unlimited performance and scalability. It helps get most out of hardware investment and resources.

Curriculum

Introduction to Data Warehouse

  • Data Warehouse Properties
  • Data Warehouse Architecture
  • ODS, Staging and Data Marts
  • Schema and Types
  • Dimensions and Facts

DataStage Features

  • DataStage jobs
  • Creating massively parallel framework
  • Scalable ETL features

Concepts of Parallelism and Partitioning

  • Data parallelism: Pipeline parallelism and Partitioning parallelism
  • Data partitioning: Key-based partitioning and Keyless partitioning
  • Partitioning techniques like round robin, entire, hash key, range, DB2 partitioning

Data Collecting Types and Techniques

  • Data collecting techniques and types like round robin, order, sorted merge and same collecting methods

DataStage Job

  • DataStage Job
  • Creating a Job that can effectively extract, transform and load data, cleansing and formatting data to improve its quality

Stage Editor

  • Parallel job stage editors
  • Types of stage editors in DataStage
  • Parallel Stages – Joins

Sequential File

  • Working with the Sequential file stages

Lookup File Set Stage

  • Parallel Stages – Lookup, Merge, Differences between join Lookup and merge, Filter, Sort

Processing Stages – Change Capture

Aggregator Stage

  • Aggregator Stage in DataStage
  • Types of aggregation – Hash mode and Sort mode

Transformer Stage in DataStage

  • Transformer Stage in DataStage
  • Characteristics – single input, any number of outputs and reject link
  • Transformer Editor
  • Evaluation sequence in this stage

Transformer Stage Functions & Features

  • Deep dive into Transformer functions – String, type conversion, null handling, mathematical, utility functions
  • Various features like constraint, system variables, conditional job aborting, Operators and Trigger Tab

DataStage – Platform Architecture

  • Describing configuration files
  • Defining Environmental variables
  • Defining Job level parameters
  • Parameter sets

Parameters and Value File

  • Parameter Set
  • Storing DataStage and Quality Stage job parameters
  • Procedure to deploy Parameter Sets function and its advantages
  • Learning about the various properties of Sequential File Stage and Dataset stage

Different Stages of Processing

  • Parallel Stages – Funnel, Remove Duplicates
  • Develop and Debug Stages – Peek, Row generator, Column generator, Sample
  • Parallel Stages – Surrogate Key Generator and Pivot Enterprise Stage

Run time Column Propagation

  • Local Containers
  • Shared Containers

 

Got a question for us? Please mention it in the comments section and we will get back to you.

 

 

0 responses on "Top IBM DataStage Interview Questions and Answers"

Leave a Message

Your email address will not be published. Required fields are marked *