35+ ESSENTIAL Datastage Interview Question & Answer
Last updated on 03rd Jul 2020, Blog, Interview Questions
Datastage is an Extract, Transform and Load ( ETL ) tool from IBM and part of their IBM InfoSphere. DataStage is “ETL tool, which extracts data, transforms it and applies business rules and then loads it to any target.” According to Wikipedia, “IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.”
1) What is Datastage?
Datastage is an ETL tool given by IBM which utilizes a GUI to design data integration solutions. This was the first ETL tool that gave a parallelism concept.
It is available in following 3 different editions
- Server Edition
- Enterprise Edition
- MVS Edition
2) Highlight the main features of Datastage?
The main features of Datastage are highlighted below:
- It is the data integration component of the IBM Infosphere information server.
- It is a GUI based tool. We just need to drag and drop the Datastage objects and we can convert it to Datastage code.
- It is used to perform the ETL operations (Extract, Transform, Load)
- It provides connectivity to multiple sources & multiple targets at the same time
- It provides partitioning and parallels processing techniques that enable the Datastage jobs to process a huge volume of data quite faster.
- It has enterprise-level connectivity.
3) What are the primary usages of the Datastage tool?
Datastage is an ETL tool that is primarily used for extracting data from source systems, transforming that data and finally loading it to target systems.
4) What are the main differences you have observed between 7.x and 8.x version of DataStage?
Here are the main differences between both the versions
7.x version was platform
This version is platform independent
It has 2-tier architecture where data stage is built on top of
It has 3-tier architecture where we have UNIX server database at the bottom then XMETA database which acts as a repositorty and then we have datastage on top.
There is no concept of parameter set
We have parameter sets which can be used anywhere in the project.
We had designer and manager as two
|In this version, the manager client was merged into designer client|
We had to manually search for the jobs in this version
|Here we have quick find option in the repository where we can search easily for the jobs.|
5) Can you highlight the main features of the IBM Infosphere information server?
The main features of IBM Infosphere information server suite are:
- It provides a single platform for data integration. It has the capability to connect to multiple source systems as well as write to multiple target systems.
- It is based on centralized layers. All the components of the suite are able to share the baseline architecture of the suite.
- It has layers for the unified repository, for integrated metadata services and a common parallel engine.
- It provides tools for analysis, cleansing, monitoring, transforming and delivering data.
- It has massively parallel processing capabilities. It turns out the processing to be very fast.
6) What are the different layers in the information server architecture?
Below are the different layers of information server architecture
- Unified user interface
- Common services
- Unified parallel processing
- Unified Metadata
- Common connectivity
7) What could be a data source system?
It could be a database table, a flat-file, or even an external application like people soft.
8) On which interface you will be working as a developer?
As a Datastage developer, we work on the Datastage client interface which is known as a Datastage designer that needs to be installed on the local system. In the backend, it is connected to the Datastage server.
9) What are the different common services in Datastage?
Below is the list of common services in Datastage:
- Metadata services
- Unified service deployment
- Security services
- Looping and reporting services.
10) How do you start developing a Datastage project?
- The very first step is to create a Datastage job on the Datastage server. All the Datastage objects that we create are stored in the Datastage project. A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines.
- A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines.
11) What are the main features of datastage?
DataStage has the following features to aid the design and processing required to build a data warehouse :
- Uses graphical design tools. With simple point and click techniques you can draw a scheme to represent your processing requirements.
- Extracts data from any number or types of database.
- Handles all the metadata definitions required to define your data warehouse.
- You can view and modify the table definitions at any point during the design of your application.
- Aggregates data.
- You can modify SQL SELECT statements used to extract data.
- Transforms data. DataStage has a set of predefined transforms and functions. you can use to convert your data. You can easily extend the functionality by defining your own transforms to use.
- Loads the data warehouse.
12) What are Stage Variables, Derivations and Constants?
- Stage Variable: An intermediate processing variable that retains value during read and doesn’t pass the value into target column.
- Derivation: Expression that specifies value to be passed on to the target column.
- Constant: Conditions that are either true or false that specifies flow of data with a link.
13) Types of views in Datastage Director?
There are 3 types of views in Datastage Director
- Job View – Dates of Jobs Compiled.
- Log View – Status of Job last run
- Status View – Warning Messages, Event Messages, Program Generated Messages.
14) How do you execute datastage job from command line prompt?
- Using “dsjob” command as follows.
- dsjob -run -jobstatus projectname jobname
15) Functionality of Link Partitioner and Link Collector?
- Link Partitioner:
It actually splits data into various partitions or data flows using various partition methods.
- Link Collector:
It collects the data coming from partitions, merges it into a single data flow and loads to target.
16) What are the types of jobs available in datastage?
- Server Job
- Parallel Job
- Sequencer Job
- Container Job
17) What is the difference between Server Job and Parallel Jobs?
- Server jobs were doesn’t support the partitioning techniques but parallel jobs support the partition techniques.
- Server jobs are not support SMTP,MPP but parallel supports SMTP,MPP.
- Server jobs are running in single node but parallel jobs are running in multiple nodes.
- Server jobs prefer while getting source data is low but data is huge then prefer the parallel
18) What is a project?
Datastage Projects – A Complete project may contain several jobs and user-defined components.
- Project Contain Datastage jobs.
- Built-in components. These are predefined components used in job.
- User-defined components. These are customized components created using the Datastage manager. each user-defined component performs a specific task in a job.
- All work done in project. Projects are created during and after installation process.you can add project on the Projects tab of Administrator.
- A project is associated with a directory.The Project directory is used by DataStage to store jobs and other datastage objects and metedata.
19) What is sequencer?
Graphically create controlling job, without using the job control function.
20)What is a container?
- A group of stages and link in a job design is called container.
- There are two kinds of Containers: Local Container And Shared Container.
- Local Container:
Local Containers only exist within the single job they are used. Use Shared Containers to simplify complex job designs.
- Shared Container:
- Shared Containers exist outside of any specific job. They are listed in the Shared Containers branch is Manager. These Shared Containers can be added to any job. Shared containers are frequently used to share a commonly used set of job components.
- A Job Container contains two unique stages. The Container Input stage is used to pass data into the Container. The Container Output stage is used to pass data out of the Container.
21) What are mainframe jobs?
A Mainframe job is complied and run on the mainframe , Data Extracted by such jobs is then loaded into the datawarehouse.
22) What are parallel jobs?
These are compiled and run on the DataStage server in a similar way to server jobs , but support parallel processing on SMP,MPP and cluster systems
23) How do you use procedure in datastage job?
Use ODBC plug,pass one dummy colomn and give procedure name in SQL tab.
24) What is odbc stage?
A Stage that extracts data from or loads data into a database that implements the industry standard Open Database Connectivity API. Used to represent a data source , an aggregation step , or target data table ( Server Job Only )
25) Explain how a source file is populated?
We can populate a source file in many ways such as by creating a SQL query in Oracle, or by using row generator extract tool etc.
26) Name the command line functions to import and export the DS jobs?
To import the DS jobs, dsimport.exe is used and to export the DS jobs, dsexport.exe is used.
27) What is the difference between Datastage 7.5 and 7.0?
In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc.
28) In Datastage, how you can fix the truncated data error?
The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.
Merge means to join two or more tables. The two tables are joined on the basis of Primary key columns in both the tables.
30)Differentiate between data file and descriptor file?
As the name implies, data files contains the data and the descriptor file contains the description/information about the data in the data files.
Enroll in Datastage Training to Build Skills & Advance Your Career
- Instructor-led Sessions
- Real-life Case Studies
31) Differentiate between datastage and informatica?
In datastage, there is a concept of partition, parallelism for node configuration. While, there is no concept of partition and parallelism in informatica for node configuration. Also, Informatica is more scalable than Datastage. Datastage is more user-friendly as compared to Informatica.
32)Define Routines and their types?
Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines.
33)How can you write parallel routines in datastage PX?
We can write parallel routines in C or C++ compiler. Such routines are also created in DS manager and can be called from transformer stage.
34) What is the method of removing duplicates, without the remove duplicate stage?
Duplicates can be removed by using Sort stage. We can use the option, as allow duplicate = false.
35) What steps should be taken to improve Datastage jobs?
In order to improve performance of Datastage jobs, we have to first establish the baselines. Secondly, we should not use only one flow for performance testing. Thirdly, we should work in increment. Then, we should evaluate data skews. Then we should isolate and solve the problems, one by one. After that, we should distribute the file systems to remove bottlenecks, if any. Also, we should not include RDBMS in start of testing phase. Last but not the least, we should understand and assess the available tuning knobs.
36) Differentiate between Join, Merge and Lookup stage?
All the three concepts are different from each other in the way they use the memory storage, compare input requirements and how they treat various records. Join and Merge needs less memory as compared to the Lookup stage.
37) Explain Quality stage?
Quality stage is also known as Integrity stage. It assists in integrating different types of data from various sources.
38) Define Job control?
Job control can be best performed by using Job Control Language (JCL). This tool is used to execute multiple jobs simultaneously, without using any kind of loop.
39) Differentiate between Symmetric Multiprocessing and Massive Parallel Processing?
In Symmetric Multiprocessing, the hardware resources are shared by processor. The processor has one operating system and it communicates through shared memory. While in Massive Parallel processing, the processor access the hardware resources exclusively. This type of processing is also known as Shared Nothing, since nothing is shared in this. It is faster than the Symmetric Multiprocessing.
40)What are the steps required to kill the job in Datastage?
To kill the job in Datasatge, we have to kill the respective processing ID.
41) Differentiate between validated and Compiled in the Datastage?
In Datastage, validating a job means, executing a job. While validating, the Datastage engine verifies whether all the required properties are provided or not. In other case, while compiling a job, the Datastage engine verifies that whether all the given properties are valid or not.
42) How to manage date conversion in Datastage?
We can use date conversion function for this purpose i.e. Oconv(Iconv(Filedname,”Existing Date Format”),”Another Date Format”).
43) Why do we use exception activity in Datastage?
All the stages after the exception activity in Datastage are executed in case of any unknown error occurs while executing the job sequencer.
44) Define APT_CONFIG in Datastage?
It is the environment variable that is used to identify the *.apt file in Datastage. It is also used to store the node information, disk storage information and scratch information.
45) Name the different types of Lookups in Datastage?
There are two types of Lookups in Datastage i.e. Normal lkp and Sparse lkp. In Normal lkp, the data is saved in the memory first and then the lookup is performed. In Sparse lkp, the data is directly saved in the database. Therefore, the Sparse lkp is faster than the Normal lkp.
46) How a server job can be converted to a parallel job?
We can convert a server job in to a parallel job by using IPC stage and Link Collector.
47) Define Repository tables in Datastage?
In Datastage, the Repository is another name for a data warehouse. It can be centralized as well as distributed.
48)What is hash file ? what are its types?
Hash file is just like indexed sequential file , this file internally indexed with a particular key value. There are two type of hash file Static Hash File and Dynamic Hash File.
49)What type of hash file is to be used in general in datastage jobs?
Static Hash File.
50) What is a stage variable?
In Datastage transformation , we can define some variable and define Value from source.
Learn Datastage Certification Course and Get Hired by TOP MNCsWeekday / Weekend BatchesSee Batch Details
51)What are constraints and derivations?
We can create constraints and derivations with datastage variable.
52) How do you reject records in a transformer?
Through datastage constraint we can reject record.
53) Why do you need stage variables?
That is depend upon job requirement , through stage variable we can file data.
54) What is the precedence of stage variables,derivations, and constraints?
stage variables =>constraints=> derivations
55) What are data elements?
A specification that describes the type of data in a column and how the data is converted.
56)What are routines ?
In Datastage routine is just like function , which we call in datastage job. there are In-Built routine and and also we can create routine.
57) What are transforms and what is the differenece between routines and transforms?
Transforms is used to manipulate data within datastage job.
58)What a datastage macro?
In datastage macro can be used in expressions , job control routines and before / after subroutines. The available macros are concerned with ascertaining job status.
59) What is job control?
A job control routine provides the means of controlling other jobs from the current job. A set of one or more jobs can be validated, run ,reset , stopped , scheduled in much the same way as the current job can be.
60) How many types of stage?
There are three basic type of stage
- Built-in stages – Supplied with DataStage and used for extracting , aggregating , transforming , or writing data. All type of job have these stage.
- Plug-in stage – Additional stages that can be installed in DataStage to perform specialized tasks that the built-in stages do not support. Server jobs and parallel jobs can make use of these.
- Job Sequence Stages – Special built-in stages which allow you to define sequences of activities to run. Only job sequencer have these.
61) Define the difference between active and Passive Stage?
There are two kinds of stages:
- Passive stages define read and write access to data sources and repositories.
- Active stages define how data is filtered and transformed.
- Sort plug-in
62) What are the plugin stages used in your projects?
- Plug-In Stage – A Stage to perform specific processing that is not supported by the standard server job stage.
- Used Plug-in – ORAOCI8, Orabulk.
63) Difference Between ORAOCI8 and Orabulk?
- ORAOCI8 – This Stage allow to connect Oracle Database.
- OraBulk – The Orabulk plug-in generates control and data files for bulk loading into a single table on an Oracle target database. The files are suitable for loading into the target database using the Oracle command sqlldr.
64) What is Sort plugin?
A mainframe processing stage that sorts input columns
65)What is Aggregate stage?
A stage type that compute s totals or other functions of set of data.
66) What is the hash file stage and Sequential file stage?
A stage that extracts data or load data into database that contain hashed file.
67) What types of flat files you used.have you used tab delimited?
Sequential flat file with comma separated.
68) What is the Job control code?
Job control code used in job control routine to creating controlling job, which invokes and run other jobs.
69) Define OConv () and IConv () functions in Datastage?
In Datastage, OConv () and IConv() functions are used to convert formats from one format to another i.e. conversions of roman numbers, time, date, radix, numeral ASCII etc. IConv () is basically used to convert formats for system to understand. While, OConv () is used to convert formats for users to understand.
70) Explain Usage Analysis in Datastage?
In Datastage, Usage Analysis is performed within few clicks. Launch Datastage Manager and right click the job. Then, select Usage Analysis and that’s it.
71) How do you find the number of rows in a sequential file?
To find rows in sequential file, we can use the System variable @INROWNUM.
72) Differentiate between Hash file and Sequential file?
The only difference between the Hash file and Sequential file is that the Hash file saves data on hash algorithm and on a hash key value, while sequential file doesn’t have any key value to save the data. Basis on this hash key feature, searching in Hash file is faster than in sequential file.
73) How to clean the Datastage repository?
We can clean the Datastage repository by using the Clean Up Resources functionality in the Datastage Manager.
74)How a routine is called in Datastage job?
In Datastage, routines are of two types i.e. Before Sub Routines and After Sub Routines. We can call a routine from the transformer stage in Datastage.
75) Differentiate between Operational Datastage (ODS) and Data warehouse?
We can say, ODS is a mini data warehouse. An ODS doesn’t contain information for more than 1 year while a data warehouse contains detailed information regarding the entire business.
76) NLS stands for what in Datastage?
NLS means National Language Support. It can be used to incorporate other languages such as French, German, and Spanish etc. in the data, required for processing by data warehouse. These languages have same scripts as English language.
77) Can you explain how could anyone drop the index before loading the data in target in Datastage?
In Datastage, we can drop the index before loading the data in target by using the Direct Load functionality of SQL Loaded Utility.
78) Does Datastage support slowly changing dimensions ?
Yes. Version 8.5 + supports this feature
79) How can one find bugs in job sequence?
We can find bugs in job sequence by using DataStage Director.
80) How complex jobs are implemented in Datstage to improve performance?
In order to improve performance in Datastage, it is recommended, not to use more than 20 stages in every job. If you need to use more than 20 stages then it is better to use another job for those stages.
81) Name the third party tools that can be used in Datastage?
The third party tools that can be used in Datastage, are Autosys, TNG and Event Co-ordinator. I have worked with these tools and possess hands on experience of working with these third party tools.
82) Define Project in Datastage?
Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components.
83) How many types of hash files are there?
There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file.
84) Define Meta Stage?
In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis.
85) Have you have ever worked in UNIX environment and why it is useful in Datastage?
Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc.
86) Differentiate between Datastage and Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration).
87) What is size of a transaction and an array means in a Datastage?
Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.
88) How many types of views are there in a Datastage Director?
There are three types of views in a Datastage Director i.e. Job View, Log View and Status View.
89) Why we use surrogate key?
In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation.
90) How rejected rows are managed in Datastage?
In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command.
91) Differentiate between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
92) Define Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
93) Define DS Designer?
The DS Designer is used to design work area and add various links to it.
94) Why do we use Link Partitioner and Link Collector in Datastage?
In Datastage, Link Partitioner is used to divide data into different parts through certain partitioning methods. Link Collector is used to gather data from various partitions/segments to a single data and save it in the target table.
95) What is a DataStage job?
- The Datastage job is simply a DataStage code that we create as a developer. It contains different stages linked together to define data and process flow.
- Stages are nothing but the functionalities that get implemented.
- For Example: Let’s assume that I want to do a sum of the sales amount. This can be a ‘group by’ operation that will be performed by one stage.
- Now, I want to write the result to a target file. So, this operation will be performed by another stage. Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.
- Once, I have defined both the stages, I need to define the data flow from my ‘group by’ stage to the target file stage. This data flow is defined by DataStage links.
96) What are DataStage sequences?
Datastage sequence connects the DataStage jobs in a logical flow.
97) If you want to use the same piece of code in different jobs, how will you achieve this?
This can be done by using shared containers. We have shared containers for reusability. A shared container is a reusable job element consisting of stages and links. We can call a shared container in different Datastage jobs.