Learn ETL Testing Interview Questions & Answers [SURE SHOT]
Last updated on 24th Jun 2020, Blog, Interview Questions
In computing, extract, transform, load is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.The most common example of ETL is used in Data warehousing. Users need to fetch the historical data as well as current data for developing the data warehouse. The Data warehouse data is nothing but combination of historical data as well as transactional data
1. What is Talend?
Open source software. Implemented in java language. So whatever you create in Talend in backend code will be generated in java.
2. Name the services offered by Talend?
Data Integration, Data Management, Cloud, big data..
3. What is Talend job?
It includes components to perform different tasks of integration.
4. How to create a job?
Go to Repository View, Right Click on Job Designs and then create a job.
5. How can you add components in your talend job?
- Components can be selected from the palette view and then drag and drop.
- In designer view you can type the component name and then select.
6. Where are you going to implement the logics for integration?
Designer view of talend job.
7. Can we modify the code of talend job in code view?
No,we can’t modify the code of talend job in code view
8. Can we write customized java code in Talend?
Yes, we can write customized java code in Talend.
9. How can we write the customized java code in Talend?
- First is using routines. Routines can be used in multiple jobs.
- Second is using components like tjava, tjavarow and tjavaflex in talend job. If we are writing code using components then that is only applicable for that specific job.
10. Difference between tjava, tjavarow and tjavaflex component in talend?
- tjava – execute only once,
- tjavarow – execute for each row,
- tjavaflex – if we want to perform a specific task at a certain time for that row. tjavaflex contains three sections: Start (code written here will execute before processing any row), Main (code will execute for each row), End (code will execute after processing every row).
11. Which component of talend can perform various tasks?
tmap is the component of talend can perform various tasks
12. List the usage of tmap components?
- Map the data from one or more sources to one or more destinations.
- Data transformation
- Filtering of data
- Rejection of data
13. Can we call it a job?
yes using the trunjob component. In the properties of the trunjob component, the job name can be selected.
14. How can you export a job?
Select the job from repository view, right click and then export job. There you can specify if you want to export this job only or dependencies also.
15. What will contain exported files from talend?
It includes .properties, .screenshot and .item files.
16. Which component can be used to print the data on the console?
tlogrow component can be used to print the data on the console
17. How can we execute jobs on remote servers in local Talend Studio?
- Firstly configure remote server –Go to Window, ->Preferences->Talend-> Run/Debug -> Remote -> Add remote server details.
- And Now to execute the job on the remote server go to Run job view -> Target Exec tab -> select the remote server and then run the job.
18. How can we create contexts which can be used in multiple job?
Create context in repository view. And then those contexts can be imported in multiple jobs wherever it’s required.
19. Is it possible to call web services from talend job?
it is possible to call web services from talend job.
20. How can we execute shell script from talend job?
By using t System component we can execute shell script from talend job
21. Why does the ‘GC overhead limit exceeded’ error occur?
error occurs when Garbage collector memory is fully utilized and no more memory left to process further. Generally in talend we will see such scenarios when job handling huge lookup data. Increase in -Xmx jvm argument will help in solving the issue.
22. which command is used to check running jobs on the server.
Jps -m command is used to check running jobs on the server
23. Types of Scheduling triggers in TAC
3 types. – Simple trigger,CronUI trigger,Filetrigger
24. How can you capture logs using the tlogCatcher component?
Using tdie, twarn components we can log events including customized message and event code.
25. How can you store job logs using AMC?
we can activate AMC by enabling user errors,warning and run time error in stats and log tab under job window.either we can log the entries in DataBase or Files or Console.
26. Name some cache memory components.
tHashInput and tHashOutput.
27. Null handling in Talend?
There are 90% chances of job failure due to Null handling exceptions. In order to avoid the exception, it is advisable to check columns for null value and then do transformations on that column.
Example : (row1.Name != null) ? StringHandling.Right(row1.Name,3) : null
28.How data can be sent from the parent job to the child job?
We can use context variables to pass value from parent to child job by assigning value in tRunJob component.
29.What is a Dynamic schema?
Using Dynamic schema we can extract all the columns irrespective of their data types in a single column.
30. What is Shared DB Connection in Talend?
We can use the same DB connection within a job multiple times by creating a shared DB connection in basic settings of the first DB connection component.
Get ETL Testing Training By Industry Experts to UPGRADE Your Skills
- Instructor-led Sessions
- Real-life Case Studies
31. Handling Hierarchical XML file and extracting data?
We can use tfileMSXMLinput, tfileMSXMLoutput components to extract data or create XML structure from Hierarchical XML having multiple loops at different levels.
32. Techniques to increase performance of a job having huge lookup data?
unmap unwanted columns, apply filters on row level, enable parallelism.
33. How to execute a Talend Job remotely?
By building the job including dependables and running executable files (.sh or .bat files) in remote server.
34. What are Routines and how they can be accessed?
Routines are reusable java code that contains functions that can be accessed throughout our project by calling routine names in the expression editor.
35. List some of the components used to connect M websphere
36. Why Talend?
- Talend provides a simple and unified user interface having Repository window, palette window, job design window and configuration/run window.
- Also, it has 800+ components available which can be dragged from the palette window and used simply.
37. How can we perform different types of joins using Talend components?
We can use tJoin and tMap. tJoin only supports inner join; we can perform other pins using tMap.
38. Benefits of using tMap over tJoin:
we can perform inner and outer joins using tMap. Also we can filter data using tMap. Also we can create multiple outputs using tMap.
39. What are context variables?
Variables having different values as per the environments. We define values for variables used in a job in DEV or PROD as context so that we can choose the values of these variables during run time without changing the job.
40. What is dynamic schema and how can we define dynamic schema?
Dynamic schema is a type of schema whose structure changes as per the incoming set of data. It can be defined while defining schema by choosing data type as dynamic.
41. What is the use of tJavaFlex?
We can run a set of user defined java codes having multiple parts using tJavaFlex.
42. How can we trigger a component after a successful run of other components?
We can use tParallelize component and configure it to wait for all to finish.
43. How can we schedule a job in Talend?
We can schedule a job using TAC. First we publish the job to nexus then export the job as a zip file and pass the job to TAC from nexus. After that we generate the job in the PROD environment then deploy it in PROD. After that we can use triggers to schedule the job on the server.
44. How many types of triggers are present in TAC/Talend?
3 types of triggers we can use to schedule the job:
- Normal trigger
- Cron trigger
- File trigger
45. How to create a Hadoop cluster in Talend?
go to metadata –> Right click on Hadoop cluster –> Create a new cluster –> provide distribution details, cloudera or hortonworks –> provide URI and credentials
46. How to read a JSON file using Talend?
We can use tFileInputJSON to read a JSON file. In the component TAB we can configure this component and define the type to read: JSON query/JSON PATH without loop.
47. How to create a Map Reduce job through Talend?
Go to job design –> right click–>choose amp reduce ob–>use tHdfsInput component –> perform operations using talend normal components like tmap/tsort etc –> load the data using tHdfsOutput
48. How can we call a child’s job a master’s job? How can we trigger particular steps before and after a job?
We can use the trunjob component to call the child job and connect it using connectors. For performing specific steps before a job we use tPrejob and for after the completion of ob we use tPostjob.
49. How many run modes are there in Talend?
There are two types of run mode:
- Normal mode
- debug mode- we can define breakpoints in a job while running a job to debug.
50. What is the use of the tContexLoad component?
tContexLoad is used to change the values of the existing context while execution. We can provide the run time values to context variables using this component.
Best ETL Testing Certification Course & Get Noticed By Top Hiring CompaniesWeekday / Weekend BatchesSee Batch Details
51. Difference between tsortaggrow and tag grow
- tsortaggrow will accept any type of data
- tag grow will accept data in sorted order
52. Difference between hash and buffet components
- Hash: Hash is used in a single job
- Buffer: Buffer can be used in different job
53. Difference between partition and departure
- Partition: Partition is you can split data into number of flows
- Departure: Departure is to make a single flow for a number of flows
54. How can we improve the performance of a job
we can improve the performance of a job by
- Reduce the number of maps
- Remove unnecessary columns
- Optimize query if included
55. Design a job where the job should start if there are records in the DB.
We have to use an online parameter if it is greater than 0 job should start.
56. Design a job where it can read files when the file is in the folder.
We have a component file where we have to write code if the file exists.
57. Design a job where updated records or newly inserted records must be loaded into the target
Design a job where we have to design it should pic from next record use Count of db
58. Explain parallelism
All jobs can run parallel where performance can be increased
59. How to improve map performance
We have advanced settings to delete cache data we have to use it.
60. what is the use of a unique match
It will pic unique record from a set of records
61. Explain the ETL testing operations included?
ETL testing includes
- Verify whether the data is transforming correctly according to business requirements
- Verify that the projected data is loaded into the data warehouse without any truncation and data loss
- Make sure that ETL application reports invalid data and replaces with default values
- Make sure that data loads at the expected time frame to improve scalability and performance.
62. Mention What are the types of data warehouse applications and what is the difference between data mining and data warehousing?
The types of data warehouse applications are
- Info Processing
- Analytical Processing
- Data Mining
Data mining can be defined as the process of extracting hidden predictive information from large databases and interpreting the data while data warehousing may make use of a data mine for analytical processing of the data in a faster way. Data warehousing is the process of aggregating data from multiple sources into one common repository
63. What are the various tools used in ETL?
- Cognos Decision Stream
- Oracle Warehouse Builder
- Business Objects XI
- SAS business warehouse
- SAS Enterprise ETL
64. What is fact? What are the types of facts?
It is a central component of a multi-dimensional model which contains the measures to be analysed. Facts are related to dimensions.
Types of facts are
- Additive Facts
- Semi-additive Facts
- Non-additive Facts
65.Explain what are Cubes and OLAP Cubes?
- Cubes are data processing units composed of fact tables and dimensions from the data warehouse. It provides multidimensional analysis.
- OLAP stands for Online Analytical Processing, and OLAP cube stores large data in multi-dimensional form for reporting purposes. It consists of facts called as measures categorized by dimensions.
66. Explain what is tracing level and what are the types?
Tracing level is the amount of data stored in the log files. Tracing level can be classified in two Normal and Verbose. Normal level explains the tracing level in a detailed manner while verbose explains the tracing levels at each and every row.
67. Explain what is Grain of Fact?
Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact Granularity
68. Explain what factless fact schema is and what is Measures?
- A fact table without measures is known as Factless fact table. It can view the number of occurring events. For example, it is used to record an event such as employee count in a company.
- The numeric data based on columns in a fact table is known as Measures.
69. Explain what is transformation?
A transformation is a repository object which generates, modifies or passes data. Transformation are of two types Active and Passive
70. Explain the use of Lookup Transformation?
The Lookup Transformation is useful for
- Getting a related value from a table using a column value
- Update slowly changing dimension table
- Verify whether records already exist in the table
71. Explain what is partitioning, hash partitioning and round robin partitioning?
To improve performance, transactions are subdivided, this is called as Partitioning. Partitioning enables Informatica Server for creating of multiple connection to various sources. The types of partitions are:
- Round-Robin Partitioning: By informatica data is distributed evenly among all partitions. In each partition where the number of rows to process are approximately same this partioning is applicable.
- Hash Partitioning: For the purpose of partitioning keys to group data among partitions Informatica server applies a hash function. It is used when ensuring the processes groups of rows with the same partitioning key in the same partition need to be ensured.
72.Mention What is the advantage of using a DataReader Destination Adapter?
The advantage of using the DataReader Destination Adapter is that it populates an ADO recordset (consist of records and columns) in memory and exposes the data from the DataFlow task by implementing the DataReader interface, so that other application can consume the data.
73.Using SSIS ( SQL Server Integration Service) what are the possible ways to update table?
To update table using SSIS the possible ways are:
- Use a SQL command
- Use a staging table
- Use Cache
- Use the Script Task
- Use full database name for updating if MSSQL is used
74. In case you have a non-OLEDB (Object Linking and Embedding Database) source for the lookup what would you do?
In case if you have non-OLEDB source for the lookup then you have to use Cache to load data and use it as source.
75. In what case do you use dynamic cache and static cache in connected and unconnected transformations?
Dynamic cache is used when you have to update master table and slowly changing dimensions (SCD) type 1 For flat files Static cache is used.
76. Explain what a data source view is?
A data source view allows to define the relational schema which will be used in the analysis services databases. Rather than directly from data source objects, dimensions and cubes are created from data source views.
77. Explain what is the difference between OLAP tools and ETL tools ?
The difference between ETL and OLAP tool is that ETL tool is meant for the extraction of data from the legacy systems and load into specified database with some process of cleansing data.
Example: Data stage, Informatica etc.
While OLAP is meant for reporting purposes in OLAP data available in multi-directional models.
Example: Business Objects, Cognos etc.
78.How you can extract SAP data using Informatica?
With the power connect option you extract SAP data using informatica
- Install and configure the PowerConnect tool
- Import the source into the Source Analyzer. Between Informatica and SAP Powerconnect act as a gateway. The next step is to generate the ABAP code for the mapping then only informatica can pull data from SAP
- To connect and import sources from external systems PowerConnect is used
79.Explain what staging area is and what is the purpose of a staging area?
Data staging is an area where you hold the data temporary on data warehouse server. Data staging includes following steps
- Source data extraction and data transformation ( restructuring )
- Data transformation (data cleansing, value transformation )
- Surrogate key assignments
80. What is Bus Schema?
For the various business processes to identify the common dimensions, BUS schema is used. It comes with a conformed dimensions along with a standardized definition of information.
81. Explain what is data purging?
Data purging is a process of deleting data from a data warehouse. It deletes junk data like rows with null values or extra spaces.
82. Explain what Schema Objects are?
Schema objects are the logical structure that directly refer to the database’s data. Schema objects includes tables, views, sequence synonyms, indexes, clusters, functions packages and database links.
83. Explain these terms Session, Worklet, Mapplet and Workflow ?
- Mapplet : It arranges or creates sets of transformation
- Worklet: It represents a specific set of tasks given
- Workflow: It’s a set of instructions that tell the server how to execute tasks
- Session: It is a set of parameters that tells the server how to move data from sources to target
84. What is an ETL process?
ETL is the process of Extraction, Transformation, and Loading.
85. How many steps are there in an ETL process?
In an ETL process, first data is extracted from a source, such as database servers, and this data is then used to generate business roles.
86. What are the steps involved in an ETL process?
The steps involved are defining the source and the target, creating the mapping, creating the session, and creating the workflow.
87. Can there be sub-steps for each of the ETL steps?
Each of the steps involved in ETL has several sub-steps. The transform step has more sub-steps.
88. What are initial load and full load?
In ETL, the initial load is the process for populating all data warehousing tables for the very first time. In full load, when the data is loaded for the first time, all set records are loaded at a stretch depending on its volume. It would erase all contents from the table and would reload the fresh data.
89. What is meant by incremental load?
Incremental load refers to applying dynamic changes as and when required in a specific period and predefined schedules.
90. What is a 3-tier system in ETL?
The data warehouse is considered to be the 3-tier system in ETL.
91. What are the names of the layers in ETL?
The first layer in ETL is the source layer, and it is the layer where data lands. The second layer is the integration layer where the data is stored after transformation. The third layer is the dimension layer where the actual presentation layer stands.
92. What is meant by snapshots?
Snapshots are the copies of the read-only data that is stored in the master table.
93. What are the characteristics of snapshots?
Snapshots are located on remote nodes and refreshed periodically so that the changes in the master table can be recorded. They are also the replica of tables.
94. What are views?
Views are built using the attributes of one or more tables. Views with a single table can be updated, but those with multiple tables cannot be updated.
95. What is meant by a materialized view log?
A materialized view log is the pre-computed table with aggregated or joined data from the fact tables, as well as the dimension tables.
96. What is the difference between PowerCenter and PowerMart?
PowerCenter processes large volumes of data, whereas Power Mart processes small volumes of data.
97. With which apps can PowerCenter be connected?
PowerCenter can be connected with ERP sources such as SAP, Oracle Apps, PeopleSoft, etc.
98. Which partition is used to improve the performances of ETL transactions?
To improve the performances of ETL transactions, the session partition is used.
99. Does PowerMart provide connections to ERP sources?
No! PowerMart does not provide connections to any of the ERP sources. It also does not allow sessions partition
100. What is meant by an operational data store?
The operational data store (ODS) is the repository that exists between the staging area and the data warehouse. The data stored in ODS has low granularity.