Top 35+ Sqoop Interview Question & Answer [MOST POPULAR]
Last updated on 03rd Jul 2020, Blog, Interview Questions
These Apache Sqoop Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Apache Sqoop . As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer.we are going to cover top Apache Sqoop Interview questions along with their detailed answers. We will be covering Apache Sqoop scenario based interview questions, Apache Sqoop interview questions for freshers as well as Apache Sqoop interview questions and answers for experienced.
1) Mention the best features of Apache Sqoop.
Apache Sqoop is a tool in the Hadoop ecosystem that has several advantages. Like
- Parallel import/export
- Connectors for all major RDBMS Databases
- Import results of SQL query
- Incremental Load
- Full Load
- Kerberos Security Integration
- Load data directly into Hive / HBase
- Support for Accumulo
2) What is Sqoop Import? Explain its purpose.
While it comes to import tables from RDBMS to HDFS we use the Sqoop Import tool. Generally, we can consider that each row in a table is a record in HDFS. Also, when we talk about text files all records are there as text data. However, when we talk about Avro and sequence files all records are there as binary data here. To be more specific, it imports individual tables from RDBMS to HDFS.
3)What is the default file format to import data using Apache Sqoop?
By using two file formats Sqoop allows data import. Such as:
i) Delimited Text File Format
Basically, to import data using Sqoop this is the default file format. Moreover, to the import command in Sqoop, this file format can be explicitly specified using the –as-text file argument. Likewise, passing this argument will produce the string-based representation of all the records to the output files with the delimiter characters between rows and columns.
ii) Sequence File Format
We can say, Sequence file format is a binary file format. Their records are stored in custom record-specific data types which are shown as Java classes. In addition, Sqoop automatically creates these data types and manifests them as java classes.
4) How can I import large objects (BLOB and CLOB objects) in Apache Sqoop?
However, direct import of BLOB and CLOB large objects is not supported by Apache Sqoop import command. So, in order to import large objects like I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.
5) How can you execute a free-form SQL query in Sqoop to import the rows in a sequential manner?
By using the –m 1 option in the Sqoop import command we can accomplish it. Basically, it will create only one MapReduce task which will then import rows serially.
6) Does Apache Sqoop have a default database?
Yes, MySQL is the default database.
7)How will you list all the columns of a table using Apache Sqoop?
- Since to list all the columns we do not have any direct command like sqoop-list-columns. So, indirectly we can achieve this is to retrieve the columns of the desired tables and redirect them to a file that can be viewed manually containing the column names of a particular table.
- Sqoop import –m 1 –connect ‘jdbc: sql server: //name of server; database=name database; username=DeZyre; password=mypassword’ –query “SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name=’table interest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’
8) If the source data gets updated every now and then, how will you synchronize the data in HDFS that is imported by Sqoop?
By using incremental parameter with data import we can synchronize the data–
–However, with one of the two options, we can use incremental parameter-
Basically, we should use incremental import with the append option. Even if the table is getting updated continuously with new rows and increasing row id values then. Especially, where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
However, in this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the last modified column in the source, the values would be updated.
9) Name a few import control commands. How can Sqoop handle large objects?
To import RDBMS data, we use import control commands
- Append: Append data to an existing dataset in HDFS.
- Columns: columns to import from the table
- Where: where clause to use during import. —
Where the common large objects are Blog and Clob. Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set the lob limit as ZERO (0) then it is stored in external memory.
10) How can we import data from a particular row or column? What are the destination types allowed in the Sqoop import command?
Basically, on the basis of where clause, Sqoop allows to Export and Import the data from the data table. So, the syntax is
- For Example:
sqoop import –connect jdbc:mysql://db.one.com/corp –table INTELLIPAAT_EMP –where “start_date> ’2016-07-20’ ”
sqoop eval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –columns “name,emp_id,jobtitle”
- However, into following services Sqoop supports data imported:
11) When to use –target-dir and when to use –warehouse-dir while importing data?
Basically, we use –target-dir to specify a particular directory in HDFS. Whereas we use –warehouse-dir to specify the parent directory of all the sqoop jobs. So, in this case under the parent directory sqoop will create a directory with the same name as the table.
12)What is the process to perform an incremental data load in Sqoop?
- In Sqoop, the process to perform incremental data load is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. Moreover, in Sqoop the delta data can be facilitated through the incremental load command.
- In addition, by using Sqoop import command we can perform incremental load. Also, by loading the data into the hive without overwriting it. However, in Sqoop the different attributes that need to be specified during incremental load are
1) Mode (incremental)
It shows how Sqoop will determine what the new rows are. Also, it has value as Append or Last Modified.
2) Col (Check-column)
Basically, it specifies the column that should be examined to find out the rows to be imported.
3) Value (last-value)
It denotes the maximum value of the check column from the previous import operation.
13). What is the significance of using –compress-codec parameter?
However, we use the –compress -code parameter to get the out file of a sqoop import in formats other than .gz like .bz2.
14) Can free-form SQL queries be used with the Sqoop import command? If yes, then how can they be used?
In Sqoop, we can use SQL queries with the import command. Basically, we should use the import command with the –e and – query options to execute free-form SQL queries. But note that the –target dir value must be specified While using the –e and –query options with the import command.
15) What is the importance of eval tools?
Basically, Sqoop Eval helps to run sample SQL queries against Database as well as preview the results on the console. Moreover, it helps to know what data we can import or if the desired data is imported or not.
16) How can you import only a subset of rows from a table?
In the sqoop import statement, by using the WHERE clause we can import only a subset of rows.
17) What are the limitations of importing RDBMS tables into Hcatalog directly?
By making use of –hcatalog –database option with the –hcatalog –table, we can import RDBMS tables into Hcatalog directly. However, there is one limitation to it is that it does not support several arguments like –as-Avro file, -direct, -as-sequencefile, -target-dir , -export-dir.
18) What is the advantage of using –password-file rather than -P option while preventing the display of password in the sqoop import statement?
Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads from standard input, preventing automation.
19) What do you mean by Free Form Import in Sqoop?
By using any SQL Sqoop can import data from a relational database query rather than only using table and column name parameters.
20) What is the role of JDBC drivers in Sqoop?
Basically, sqoop needs a connector to connect to different relational databases. Since, as a JDBC driver, every DB vendor makes this connector available which is specific to that DB. Hence, to interact with Sqoop needs the JDBC driver of each of the databases it needs.
21) Is JDBC driver enough to connect sqoop to the databases?
No. to connect to a database Sqoop needs both JDBC and connector.
22) What is InputSplit in Hadoop?
Input Split is defined as while a Hadoop job runs, it splits input files into chunks and also assigns each split to a mapper to process.
23) What is the work of Export in Hadoop sqoop?
Export tool transfer the data from HDFS to RDBMS
24) Use of Codegen command in Hadoop sqoop?
Basically, Codegen command generates code to interact with database records
25) Use of Help command in Hadoop sqoop?
Help command in Hadoop sqoop generally list available commands
26) How can you schedule a job using Oozie?
However, Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
27) What is the importance of — the split-by clause in running parallel import tasks in sqoop?
In Sqoop, it mentions the column name based on whose value the data will be divided into groups of records. Further, by the MapReduce tasks, these groups of records will be read in parallel.
28) What is a sqoop metastore?
- A tool that Sqoop hosts a shared metadata repository is what we call sqoop metastore. Moreover, multiple users and/or remote users can define and execute saved jobs (created with the sqoop job) defined in this metastore.
- In addition, with the –meta-connect argument Clients must be configured to connect to the metastore in sqoop-site.xml.
29) What is the purpose of sqoop-merge?
The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
30) How can you see the list of stored jobs in sqoop metastore?
sqoop job –list
Best Sqoop Training with Advanced Topics By Expert Trainers
- Instructor-led Sessions
- Real-life Case Studies
31) Which database does the sqoop metastore run on?
Basically, on the current machine running sqoop-metastore launches, a shared HSQLDB database instance.
32) Where can the metastore database be hosted?
Anywhere, it means we can host a metastore database within or outside of the Hadoop cluster.
33) Give the sqoop command to see the content of the job named myjob?
Sqoop job –show myjob
34) How can you control the mapping between SQL data types and Java types?
we can configure the mapping between by using the –map-column-java property.
$ sqoop import … –map-column-java id = String, value = Integer
35) Is it possible to add a parameter while running a saved job?
Yes, by using the –exec option we can add an argument to a saved job at runtime.
sqoop job –exec job name — — newparameter
36) What is the usefulness of the options file in sqoop.
To specify the command line values in a file and use it in the sqoop commands we use the options file in sqoop.
- For example
The –connect parameter’s value and –user name value can be stored in a file and used again and again with different sqoop commands.
37) How can you avoid importing tables one-by-one when importing a large number of tables from a database?
Using the command
–exclude-tables table1,table2 ..
Basically, this will import all the tables except the ones mentioned in the exclude-tables clause.
38) How can you control the number of mappers used by the sqoop command?
To control the number of mappers executed by a sqoop command we use the parameter –num-mappers. Moreover, we should start with choosing a small number of map tasks and then gradually scale up as choosing a high number of mappers initially may slow down the performance on the database side.
39) What is the default extension of the files produced from a sqoop import using the –compress parameter?
40) What is the disadvantage of using –direct parameter for faster data load by sqoop?
The native utilities used by databases to support faster load do not work for binary data formats like SequenceFile.
41) How will you update the rows that are already exported?
Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.
42). What are the basic commands in Apache Sqoop and its uses?
The basic commands of Apache Sqoop are:
- Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables, Versions.
Moreover, uses of Apache Sqoop basic commands are:
- Codegen- It helps to generate code to interact with database records.
- Create- hive-table- It helps to Import a table definition into a hive
- Eval- It helps to evaluate SQL statement and display the results
- Export- It helps to export an HDFS directory into a database table
- Help- It helps to list the available commands
- Import- It helps to import a table from a database to HDFS
- Import-all-tables- It helps to import tables from a database to HDFS
- List-databases- It helps to list available databases on a server
- List-tables- It helps to list tables in a database
- Version- It helps to display the version information
43) How did the Sqoop word come about? Sqoop is which type of tool and the main use of sqoop?
Sqoop word came from SQL+HADOOP=SQOOP.
Basically, it is a data transfer tool. We use Sqoop to import and export a large amount of data from RDBMS to HDFS and vice versa.
44)What is Sqoop Validation?
It means to validate the data copied. Either import or export by comparing the row counts from the source as well as the target post copy. Likewise, we use this option to compare the row counts between source as well as the target just after data imported into HDFS. Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this change. Also updates the log file.
45) What is the Purpose to Validate in Sqoop?
In Sqoop, validating the data copied is Validation’s main purpose. Basically, either Sqoop import or Export by comparing the row counts from the source as well as the target post copy.
46) What is Sqoop Job?
\ To perform an incremental import if a saved job is configured, then state regarding the most recently imported rows is updated in the saved job. Basically, that allows the job to continually import only the newest rows.
47). What is Sqoop Import Mainframe Tool and its Purpose?
Basically, a tool which we use to import all sequential datasets in a partitioned dataset (PDS) on a mainframe to HDFS is Sqoop Import Mainframe. That tool is what we call an import mainframe tool. Also, A PDS is akin to a directory on the open systems. Likewise, in a dataset, the records can only contain character data. Moreover here, records will be stored as a single text field with the entire record.
48) What is the purpose of Sqoop List Tables?
- Basically, the main purpose of sqoop-list-tables is list tables present in a database.
- Learn all insights of Sqoop List Tables, follow the link: Sqoop List Tables – Arguments and Examples
49) Difference Between Apache Sqoop vs Flume.
So, let’s discuss all the differences on the basis of features.
a. Data Flow
Apache Sqoop – Basically, Sqoop works with any type of relational database system (RDBMS) that has the basic JDBC connectivity. Also, Sqoop can import data from NoSQL databases like MongoDB, Cassandra and along with it. Moreover, it allows data transfer to Apache Hive or HDFS.
Apache Flume– Likewise, Flume works with streaming data sources that are generated continuously in Hadoop environments. Like log files.
b. Type of Loading
Apache Sqoop – Basically, Sqoop load is not driven by events.
Apache Flume – Here, data loading is completely event-driven.
c. When to use
Apache Sqoop – However, if the data is being available in Teradata, Oracle, MySQL, PostgreSQL or any other JDBC compatible database it is considered an ideal fit.
- Apache Flume – While we move bulk of streaming data from sources like JMS or spooling directories, it is the best choice.
d. Link to HDFS
Apache Sqoop – Basically, for importing data in Apache Sqoop, HDFS is the destination
Apache Flume – In Apache Flume, data generally flow to HDFS through channels
Apache Sqoop – Basically, it has connector based architecture. However, that means the connectors know a great deal in connecting with the various data sources. Also to fetch data correspondingly.
Apache Flume – However, it has agent-based architecture. Basically, it means code written in Flume is a call agent that may be responsible for fetching the data.