Apache Hive Interview Questions & Answer [GUIDE TO CRACK]
Hive Interview Questions and Answers

Apache Hive Interview Questions & Answer [GUIDE TO CRACK]

Last updated on 03rd Jul 2020, Blog, Interview Questions

About author

Keerthana (Sr Tech Lead Engineer - Director Level )

(5.0) | 16547 Ratings 2222

These Hive  Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hive.As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer.we are going to cover top 100 Hive Interview questions along with their detailed answers. We will be covering Hive scenario based interview questions, Hive  interview questions for freshers as well as Hive  interview questions and answers for experienced. 

1.What is a partition in Hive?

Ans:

Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.

2. What kind of applications is supported by Apache Hive?

Ans:

Hive supports all those client applications that are written in:

  • Java
  • PHP
  • Python
  • C++
  • Ruby

by exposing its Thrift server.

3. Where does the data of a Hive table gets stored?

Ans:

By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml. 

4. What is a metastore in Hive?

Ans:

Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.

5. Why Hive does not store metadata information in HDFS?

Ans:

Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

6. What is the difference between local and remote metastore?

Ans:

Local Metastore:

In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.

Remote Metastore:

In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.

7. What is the default database provided by Apache Hive for metastore?

Ans:

By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.

8. Scenario:Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time? 

Ans:

The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently. 

Following are the steps to configure MySQL database as the local metastore in Apache Hive:

One should make the following changes in hive-site.xml:

  • javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
  • seIfNotExist=true.
  • javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
  • One should also set the username and password as:
  • javax.jdo.option.ConnectionUserName is set to desired username.
  • javax.jdo.option.ConnectionPassword is set to the desired password.
  • The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be copied into the Hive’s lib directory.

Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.

9. What is the difference between external table and managed table?

Ans:

Here is the key difference between an external table and managed table:

  • In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
  • On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched. 

Note: I would suggest you to go through the blog on Hive Tutorial to learn more about Managed Table and External Table in Hive.

10.What is Apache Hive?

Ans:

 Basically, a tool which we call a data warehousing tool is Hive. However, Hive gives SQL queries to perform an analysis and also an abstraction. Although, Hive it is not a database it gives you logical abstraction over the databases and the tables.

11. Is Hive suitable to be used for OLTP systems? Why?

Ans:

 No, it is not suitable for OLTP system since it does not offer insert and update at the row level.

12. Is it possible to change the default location of a managed table?

Ans:

Yes, by using the clause – LOCATION ‘<hdfs_path>’ we can change the default location of a managed table.

13.What are the different types of tables available in HIve?

Ans:

There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.

14.Can a table be renamed in Hive?

Ans:

Alter Table table_name RENAME TO new_name

15.Can we change the data type of a column in a hive table?

Ans:

Using REPLACE column option

ALTER TABLE table_name REPLACE COLUMNS ……

16.What is the need for custom Serde?

Ans:

Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.

17.Why do we need Hive?

Ans:

Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.

18.What is the default location where hive stores table data?

Ans:

hdfs://namenode_server/user/hive/warehouse

19.What are the three different modes in which hive can be run?

Ans:

  • Local mode
  • Distributed mode
  • Pseudodistributed mode

20.Is there a date data type in Hive?

Ans:

Yes. The TIMESTAMP data types stores date in java.sql.timestamp format

    Subscribe For Free Demo

    21.What are collection data types in Hive?

    Ans:

    There are three collection data types in Hive.

    • ARRAY
    • MAP
    • STRUCT

    22.When to use Hive?

    Ans:

    Hive is useful when making data warehouse applications

    • When you are dealing with static data instead of dynamic data
    • When application is on high latency (high response time)
    • When a large data set is maintained
    • When we are using queries instead of scripting.

    23.Mention when to use Map reduce mode?

    Ans:

    Map reduce mode is used when,

    • It will perform on large amount of data sets and query going to execute in a parallel way
    • Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
    • Processing large data sets with better performance needs to be achieved

    24.Mention key components of Hive Architecture?

    Ans:

    Key components of Hive Architecture includes,

    • User Interface
    • Compiler
    • Metastore
    • Driver
    • Execute Engine

    25.Mention what Hive is composed of ?

    Ans:

    Hive consists of 3 main parts,

    • Hive Clients
    • Hive Services
    • Hive Storage and Computing

    26. Mention what are the type of database does Hive support ?

    Ans:

    For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.

    27.Mention Hive default read and write classes?

    Ans:

    Hive default read and write classes are

    • TextInputFormat/HiveIgnoreKeyTextOutputFormat
    • SequenceFileInputFormat/SequenceFileOutputFormat

    28.What is Hive Bucketing?

    Ans:

    When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables.  For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.

    29.What Options Are Available When It Comes to Attaching Applications to the Hive Server?

    Ans:

    The three different ways 

    • Thrift Client
    • JDBC Driver
    • ODBC Driver

     for example, using JDBC will support the JDBC protocol.

    30. What Variations of Tables Are Available in Hive?

    Ans:

    The two types of tables are managed tables and external tables.

    Course Curriculum

    Best JOB Oriented Hive in Depth Training Training By Industry Experts

    • Instructor-led Sessions
    • Real-life Case Studies
    • Assignments
    Explore Curriculum

    31.What File Formats and Applications Does Hive Support?

    Ans:

    Applications written in C++, Python, Java, PHP, and Ruby are generally supported in Hive. When it comes to filing formats, Hive supports text file formats by default but also supports binary file formats, such as Avro data, ORC, Sequence, and Parquet files.

    32.What is the use of Hcatalog?

    Ans:

    Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.

    33.Explain the difference between partitioning and bucketing.

    Ans:

    • Partitioning and Bucketing of tables is done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
    • Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
    • In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
    • Basically, a bucket is a file in Hive whereas partition is a directory.

    34. Explain about the different types of partitioning in Hive?

    Ans:

    Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.

    Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.

    When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.

    35. When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?

    Ans:

    When running Hive in embedded mode, it creates a local metastore. When you run the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.

    The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.

    36.What are the components of a Hive query processor?

    Ans:

    Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies. The various components of a query processor are-

    • Parser
    • Semantic Analyser
    • Type Checking
    • Logical Plan Generation
    • Optimizer
    • Physical Plan Generation
    • Execution Engine
    • Operators
    • UDF’s and UDAF’s.

     37. Differentiate between describe and describe extended.

    Ans:

    • Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.
    • Describe extended database/schema- Gives the details of the database or schema in a detailed manner.

    38.Is it possible to overwrite Hadoop MapReduce configuration in Hive?

    Ans:

    Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.

    39. I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?

    Ans:

    Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.

    40. What is the use of explode in Hive?

    Ans:

    Explode in Hive is used to convert complex data types into desired table formats. explode UDTF basically emits all the elements in an array into multiple rows.

    41.Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.

    Ans:

    SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.

    ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.

    DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.

    CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.

    42. Write a hive query to view all the databases whose name begins with “db”

    Ans:

    SHOW DATABASES LIKE ‘db.*’

    43.How can you prevent a large job from running for a long time?

    Ans:

    This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;

    The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

    44.Are multiline comments supported in Hive?

    Ans:

    No

    45.What is ObjectInspector functionality?

    Ans:

    ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.

    46.Explain about the different types of join in Hive.

    Ans:

    HiveQL has 4 different types of joins –

    • JOIN- Similar to Outer Join in SQL
    • FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
    • LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
    • RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

    47. What is dynamic partitioning, and when would you use it?

    Ans:

    This two-part question tests your knowledge about a specific Hive feature and requests that you apply it to a previous professional experience. When rehearsing your answer, be sure to include a clear definition and the ways you have used dynamic partitioning in your previous work.

    Example: “Dynamic partitioning is a tool that allows you to edit the dynamics of a program without having to first shut it down. It is often used as a solution to technical problems that can be caused by fixed partitioning. I often used dynamic partitioning in my last position when transferring fixed information and data from one server to another. This process reduced the number of errors I received and allowed me to efficiently move data without fixed partitioning getting in the way.”

    48.How and when would you change settings with Hive commands?

    Ans:

    This question is also testing your knowledge and requesting that you apply it to a specific situation. The first part of your response should answer the question about Hive commands. The second part can include a scenario from previous work experience or a hypothetical situation in which you would use Hive commands to change the settings.

    Example: “You can change settings within Hive by using the SET command. This feature allows you to change all types of settings including writing and executing scripts, creating tables and deleting data. During my internship, I was tasked with creating a new database and wanted to set the characteristics of each graph so that I could add data tables, so I changed the Hive settings using the SET command.”

    49. Where does Hive data get stored?

    Ans:

    When answering this question, consider how and where Hive data is stored and why knowing the location is important. Your answer will include an explanation and it can be useful to connect it to previous professional experience.

    Example: “Hive data is stored by default in an Hadoop Distributed File System or HDFS directory. However, as I often did in my previous position, you can better organize your data by designating a specific storage area through the configuration parameter feature. Using this feature, my team and I were able to organize our data in a way that allowed non-technical employees to access data.”

    50. Is metadata also stored in the database?

    Ans:

    This is a follow-up question to the previous database one. It is testing your understanding of the database storage system of Hive. Structure your answer in a similar method with the first part answering the question and the second part offering an example of when you have used metadata storage previously.

    Example: “Metadata is not stored in the HDFS directory because this directory is designed to produce low latency. Instead, the metadata is stored in the RDBMS directory, or MetaStore, where it can be later retrieved if needed. In order to keep our data well-organized, my team and I ensured the default settings sent the metadata to the RDBMS director in order to maintain storage and organization.”

    Course Curriculum

    Gain Hands-on Experience with Hive Certification Course to Build Your Skills

    Weekday / Weekend BatchesSee Batch Details

    51. Explain the process of data transfer with Hive.

    Ans:

    This question requires that you explain the technical process behind the data transfers that occur with Hive. Similar to the other questions, provide an explanation of the process and then further demonstrate your understanding by identifying a time when you used this process in your professional career.

    Example: “Transferring data with Hive takes stored data from the HDFS and moves it to Hive with a single command. I often complete this task by using an external table. I define the table and then change its location, making it easy to transfer data across different programs, which increases efficiency and cuts down on data entry timelines.”

    52.What does meta-store means in Hive?

    Ans:

    Meta-store in Hive stores the meta information utilizing RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which changes over the object portrayal into a relational schema.

    Hive meta-store comprises of two major units:

    • A service that gives meta-store access to other Apache Hive administrations.
    • Disk storage for the Hive metadata, which is separate from HDFS stockpiling.

    53. What in Hive made out of?

    Ans:

    Hive is made out of;

    • Clients
    • Services
    • Storage and Computing

    54.Say what the Object Inspector functionality is in Hive?

    Ans:

    In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.

    55. What is (HS2) Hive Server2?

    Ans:

    Hive Server2 is a server interface. Various functions, which are followed by Hive Server2 are as follows:

    • Works against Hive by enabling remote customers to execute questions.
    • The outcomes of inquiries specified are retrieved.

    56.List down the segments of a Hive question processor?

    Ans:

    The segments of a Hive question processor are as follows:

    • Logical Plan Generation
    • Physical Plan Generation
    • UDF’s and UDAF’s
    • Execution Engine
    • Operators
    • Semantic Analyzer
    • Optimizer
    • Type Checking
    • Parser

    57.Clarify how Hive De-serialize and serialize the information?

    Ans:

    Serialization and de-serialization designs are prominently known as SerDes. Hive enables the system to read or write information in a specific format. These formats parse the organized or unstructured data bytes put away in HDFS by the definition of Hive tables.

    58. Say what the views are in Hive?

    Ans:

    Views are Similar to tables In Hive; They are produced based on various requirements:

    • Any results can be spared asset data as a view in Hive
    • Similar to views utilized as a part of SQL in use.
    • All kind of DML tasks can be performed on a view.

    59. Say when to pick “Inward Table” and “Outside Table” in Hive?

    Ans:

    In Hive, you can pick an internal table

    • If the preparing data accessible in the local file system.
    • If we need Hive to deal with the entire lifecycle of data including the cancellation
    • You can pick an External table
    • If processing information accessible in HDFS
    • Useful when the documents are being utilized outside of Hive

    60. What is the man difference between HBase and Hive?

    Ans:

    Both hive and HBase can be utilized in different technologies that depend on Hadoop. Hive happens to be an infrastructure warehouse of information, which is utilized on Hadoop while HBase is NoSQL. The key esteem stores which keep running on Hadoop themselves. The hive will also enable the individuals who know about SQL run a few of jobs in MapReduce when Hbase will also bolster 4 of the activities, for example, put, get, scan and erase. The HBase happens to be useful for questioning for information yet Hive then again is useful for questioning information is analytical and is gathered over a while.

    61.Clarify about the SMB Join in Hive?

    Ans:

    In SMB join in Hive, every mapper peruses a bucket from the first table and the relating bucket from the second table, and after that, a merge sort join is performed. Sort Merge Bucket (SMB) joins in the hive is for the most utilized as there are no restrictions on file or segment or table join. SMB join can best be utilized when the tables are huge. In SMB join the sections are bucketed and arranged to utilize the join segments. All tables ought to have a similar number of buckets in SMB join.

    62.What do Hive variable means? How can we utilize it?

    Ans:

    Hive variable is made in the Hive condition that can be referenced by Hive contents. It is utilized to pass a few values to the hive inquiries when the queries begin executing.

    63.What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?

    Ans:

    The command:  ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.

    64. What is indexing and why do we need it?

    Ans:

    One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.

    65.What does /*streamtable(table_name)*/ do?

    Ans:

    It is query hint to stream a table into memory before running the query. It is a query optimization Technique.

    66.Can a partition be archived? What are the advantages and Disadvantages?

    Ans:

    Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.

    67.What is a generic UDF in hive?

    Ans:

    It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.

    68.The following statement failed to execute. What can be the cause?

    Ans:

    LOAD DATA LOCAL INPATH ‘${env:HOME}/country/state/’

    OVERWRITE INTO TABLE address;

    The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.

    69.How do you specify the table creator name when creating a table in Hive?

    Ans:

    The TBLPROPERTIES clause is used to add the creator name while creating a table.

    The TBLPROPERTIES is added like −

    TBLPROPERTIES(‘creator’= ‘Joan’)

    70.Explain how can you change a column data type in Hive?

    Ans:

    You can change a column data type in Hive by using command,

    ALTER TABLE table_name CHANGE column_name column_name new_datatype;

    Hive Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

    71.Mention how can you stop a partition form being queried?

    Ans:

    You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free