35+ BEST Big Data [ Hadoop ] Interview Questions & Answers
Big Data Hadoop Interview Questions and Answers

35+ BEST Big Data [ Hadoop ] Interview Questions & Answers

Last updated on 03rd Jul 2020, Blog, Interview Questions

About author

Anilkumar (Sr Technical Director )

(5.0) | 15212 Ratings 3709

These hadoop Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hadoop.during your interview,normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer.we are going to cover top 100 Hadoop Interview questions along with their detailed answers. We will be covering Hadoop scenario based interview questions, Hadoop interview questions for freshers as well as Hadoop interview questions and answers for experienced. 

Q1. What do the four Vs of Big Data denote? 


IBM has a nice, simple explanation for the four critical features of big data:

  • Volume –Scale of data
  • Velocity –Analysis of streaming data
  • Variety – Different forms of data
  • Veracity –Uncertainty of data

Q2. How big data analysis helps businesses increase their revenue?


Big data analysis is helping businesses differentiate themselves – for example Walmart the worlds largest retailer in 2014 in terms of revenue – is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. 

Q3. Name some companies that use Hadoop?


  • Yahoo (One of the biggest user &more than 80% code contributor to Hadoop)
  • Facebook
  • Netflix
  • Amazon
  • Adobe
  • eBay
  • Hulu
  • Spotify
  • Rubikloud
  • Twitter

Q4. Differentiate between Structured and Unstructured?


Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.

Q5. On what concept the Hadoop framework works? 


Hadoop Framework works on the following two core components-

  • HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.
  • Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.

Q6. What are the main components of a Hadoop Application? 


Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.

Core components of a Hadoop application are-

  • Hadoop Common
  • HDFS
  • Hadoop MapReduce
  • YARN
  • Oozie and Zookeeper.

Q7. What is Hadoop streaming? 


Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers. The latest tool for Hadoop streaming is Spark.

Q8. What is the best hardware configuration to run Hadoop? 


The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low – end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.

Q9.What are the most commonly defined input formats in Hadoop? 


The most common Input Formats defined in Hadoop are:

  • Text Input Format– This is the default input format defined in Hadoop.
  • Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
  • Sequence File Input Format- This input format is used for reading files in sequence.

Q10. What are the steps involved in deploying a big data solution?


  • Data Ingestion – The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel , RDBMS like MySQL or Oracle, or could be the log files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.
  • Data Storage – The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.
  • Data Processing – The ultimate step is to process the data using one of the processing frameworks like mapreduce, spark, pig, hive, etc.

Q11. How will you choose various file formats for storing and processing data using Apache Hadoop ?


The decision to choose a particular file format is based on the following factors-

  • Schema evolution to add, alter and rename fields.
  • Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
  • Splittability to be processed in parallel.
  • Read/Write/Transfer performance vs block compression saving storage space

Q12.Explain Big Data and what are five Vs of Big Data?


Big datais the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.

  • Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes. 
  • Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterdays data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  • Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
  • Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and may be difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
  • Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless. As we know Big Data is growing at an accelerating rate, so the factors associated with it are also evolving. To go through them and understand it in detail, I recommend you to go through Big Data Tutorial blog.

Q13. What is Hadoop and its component?


When Big Data emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which cant be done efficiently and effectively using traditional systems.

  • YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
  • Resource Manager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
  • NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single Data Node.

Q14. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster?


Name Node: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.

  • Datanode: It is the slave node that contains the actual data.
  • Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
  • ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
  • NodeManager: It runs on slave machines, and is responsible for launching the applications containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.

Q15. Compare HDFS with Network Attached Storage (NAS).


In this question, first explain NAS and HDFS, and then compare their features as follows:

      Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.

In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.

HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.

Q16. List the difference between Hadoop 1 and Hadoop 2


This is an important question and while answering this question, we have to mainly focus on two points i.e. Passive NameNode and YARN architecture.

      In Hadoop 1.x, Name Nodeis the single point of failure. In Hadoop 2.x, we have Active and Passive NameNodes. If the active NameNodefails, the passive NameNodetakes charge. Because of this, high availability can be achieved in Hadoop 2.x.

Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.

Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x   Hadoop 2.x

Passive NameNode   NameNode is a Single Point of Failure   Active &Passive NameNode

Processing   MRV1 (Job Tracker &Task Tracker)   MRV2/YARN (ResourceManager &NodeManager).

Q17.What are active and passive NameNodes? 


In HA (High Availability) architecture, we have two NameNodes – Active NameNodeand Passive NameNode.

Active NameNodeis the NameNodewhich works and runs in the cluster.

Passive NameNodeis a standby NameNode, which has similar data as active NameNode.

When the active NameNodefails, the passive NameNodereplaces the active NameNodein the cluster. Hence, the cluster is never without a NameNodeand so it never fails.

Q18. Why does one remove or add nodes in a Hadoop cluster frequently?


One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent DataNodecrashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) Data Nodesin a Hadoop Cluster.

Read this blog to get a detailed understanding on commissioning and decommissioning nodes in a Hadoop cluster.

Q19. What happens when two clients try to access the same file in the HDFS?


When the first client contacts the NameNodeto open the file for writing, the NameNodegrants a lease to the client to create this file. When the second client tries to open the same file for writing, the NameNodewill notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

  • Course Curriculum
  • Big Data Hadoop Certification Training
  • Instructor-led Sessions
  • Real-life Case Studies
  • Assessments
  • Lifetime Access

Q20. How does NameNode tackle DataNode failures?


NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.

A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.

The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

    Subscribe For Free Demo

    Q21. What will you do when NameNode is down?


    The NameNode recovery process involves the following steps to make the Hadoop cluster up and running:

    Use the file system metadata replica (FsImage) to start a new NameNode. 

    Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.

    Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes. 

    Q22.What is a checkpoint?


    In brief, Checkpointingis a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.

    Q23. How is HDFS fault tolerant? 


    When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

    Q23.Can NameNode and DataNode be a commodity hardware? 


    The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience, you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.

    Q24. Why do we use HDFS for applications having large data sets and not when there are a lot of small files? 


    HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit t  o the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.

    Q25. How do you define blockin HDFS? What is the default block size in Hadoop 1 and in Hadoop 2? Can it be changed? 


    Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. HDFS stores each as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.

    Hadoop 1 default block size: 64 MB

    Hadoop 2 default block size: 128 MB

    Q26.What does jpscommand do? 


    The jpscommand helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

    Q27. How do you define Rack Awarenessin Hadoop?


    Rack Awareness is the algorithm in which the NameNodedecides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between DataNodeswithin the same rack. Lets say we consider replication factor 3 (default), the policy is that for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is known as the Replica Placement Policy.

    To know rack awareness in more detail, refer to the HDFS architecture blog. 

    Q28.What is speculative executionin Hadoop? 


    If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called speculative execution.

    Q29.How can I restart NameNodeor all the daemons in Hadoop? 


    This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:

    You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.

    To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.

    These script files reside in the sbin directory inside the Hadoop directory.

    Course Curriculum

    Get JOB Oriented Big data Hadoop Certification Course

    Weekday / Weekend BatchesSee Batch Details

    Q30.What is the difference between an HDFS Blockand an Input Split?


    The HDFS Blockis the physical division of the data while Input Splitis the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

    Q31.Name the three modes in which Hadoop can run.


    The three modes in which Hadoop can run are as follows:

    • Standalone (local) mode: This is the default mode if we dont configure anything. In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses the local filesystem.
    • Pseudo-distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
    • Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.

    Hadoop MapReduce Interview Questions

    Q32.What is Map Reduce? What is the syntax to run a Map Reduce program? 


    It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming. The syntax to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.

    If you have any doubt in MapReduce or want to revise your concepts you can refer this MapReduce tutorial.

    Q33.What are the main configuration parameters in a Map Reduce program?


    The main configuration parameters which users need to specify in MapReduceframework are:

    • Jobs input locations in the distributed file system
    • Jobs output location in the distributed file system
    • Input format of data
    • Output format of data
    • Class containing the map function
    • Class containing the reduce function
    • JAR file containing the mapper, reducer and driver classes.

    Q34.What is the purpose of RecordReaderin Hadoop?


    The InputSplitdefines a slice of work, but does not describe how to access it. The RecordReaderclass loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mappertask. The RecordReaderinstance is defined by the Input Format.

    Q35. Explain Distributed Cachein a MapReduce Framework.


    Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.

    Q36. How do reducerscommunicate with each other? 


    This is a tricky question. The MapReduceprogramming model does not allow reducersto communicate with each other. Reduces run in isolation.

    Q37. What does a MapReduce Partitionerdo?


    A MapReduce Partitionermakes sure that all the values of a single key go to the same reducer, thus allowing even distribution of the map output over the reducers. It redirects the mapperoutput to the reducerby determining which reduceris responsible for the particular key.

    Q38.What are the different Hadoop configuration files?


    The different Hadoop configuration files include:

    • hadoop-env.sh
    • mapred-site.xml
    • core-site.xml
    • yarn-site.xml
    • hdfs-site.xml
    • Master and Slaves

     Q39.What are the three modes in which Hadoop can run?


    The three modes in which Hadoop can run are :

    • Standalone mode: This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.
    • Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
    • Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.

    Q40. What are the differences between regular FileSystem and HDFS?


          Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.

    HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.

    Q41.Why is HDFS fault-tolerant?


    HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.

    Q42. Explain the architecture of HDFS. 


    The architecture of HDFS is as shown:


    For an HDFS service, we have a NameNode that has the master process running on one of the machines and DataNodes, which are the slave nodes.

    • NameNode: NameNode is the master service that hosts metadata in disk and RAM. It holds information about the various DataNodes, their location, the size of each block, etc. 
    • DataNode: DataNodes hold the actual data blocks and send block reports to the NameNode every 10 seconds. The DataNode stores and retrieves the blocks when the NameNode asks. It reads and writes the clients request and performs block creation, deletion, and replication based on instructions from the NameNode.

    Q43.What are the two types of metadata that a NameNode server holds?


    The two types of metadata that a NameNode server holds are:

    Metadata in Disk – This contains the edit log and the FSImage

    Metadata in RAM – This contains the information about DataNodes

    Q44.What is the difference between a federation and high availability?


    There is no limitation to the number of NameNodes and the NameNodes are not related to each other

    All the NameNodes share a pool of metadata in which each NameNode will have its dedicated pool

    Provides fault tolerance, i.e., if one NameNode goes down, it will not affect the data of the other NameNode

    There are two NameNodes that are related to each other. Both active and standby otal. The size of each split is 128 MB, 128MB, and 94 MB.


    Q46.How does rack awareness work in HDFS?


    HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across the racks of a Hadoop Cluster.


    By default, each block of data is replicated three times on various DataNodes present on different racks. Two identical blocks cannot be placed on the same DataNode. When a cluster is rack-aware,all the replicas of a block cannot be placed on the same rack. If a DataNode crashes, you can retrieve the data block from different DataNodes.   

    Q47.How can you restart NameNode and all the daemons in Hadoop?


    The following commands will help you restart NameNode and all the daemons:

    You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.

    Q48.Which command will help you find the status of blocks and FileSystem health?


    To check the status of the blocks, use the command:

    • hdfs fsck <path>-files -blocks

    To check the health status of FileSystem, use the command:

    • hdfs fsck / -files –blocks –locations >dfs-fsck.log

    Q49.What would happen if you store too many small files in a cluster on HDFS?


    Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.

    Q50.How do you copy data from the local system onto HDFS? 


    The following command will copy data from the local file system onto HDFS:

    • hadoop fs –copyFromLocal [source] [destination]

    Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv

    In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS. 

    Course Curriculum

    Build Your Big data Hadoop Skills with Hadoop Training By Real Time Experts

    • Instructor-led Sessions
    • Real-life Case Studies
    • Assignments
    Explore Curriculum

    Q51.When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?


    The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.

    This is used to run the HDFS client and it refreshes node configuration for the NameNode. 

    Q52.Is there any way to change the replication of files on HDFS after they are already written to HDFS?


    Yes, the following are ways to change the replication of files on HDFS:

    We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.

    If you want to change the replication factor for a particular file or directory, use:

    $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file

    Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv.

    Q53.Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?


    • In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block. 
    • Under-replicated blocks:

    These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

    Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes. 

    Over-replicated blocks:These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

    Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes. 

    Q54.What is the distributed cache in MapReduce?


    A distributed cache is a mechanism wherein the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing.

    Q55.What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?



    This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read. 


    This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase. 


    The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

    Q56.Why is MapReduce slower in processing data in comparison to other processing frameworks?


    This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:

    MapReduce is slower because:

    It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data. 

    During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.

    In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.

    Big Data Hadoop Certification Training Course

    Master Big Data and Hadoop EcosystemEXPLORE COURSE

    Q57.Is it possible to change the number of mappers to be created in a MapReduce job?


    By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

    For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

    Q58.Name some Hadoop-specific data types that are used in a MapReduce program.


    This is an important question, as you would need to know the different data types if you are getting into the field of big data.

    For every data type in Java, you have an equivalent in Hadoop. Therefore, the following are some Hadoop-specific data types that you could use in your MapReduce program:

    • IntWritable
    • FloatWritable 
    • LongWritable 
    • DoubleWritable 
    • BooleanWritable 
    • ArrayWritable 
    • MapWritable 
    • ObjectWritable 

    Q59.What is speculative execution in Hadoop?


    If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.

    The following image depicts the speculative execution:

    Q60.How is identity mapper different from chain mapper?

    • Identity Mapper
    • Chain Mapper


    This is the default mapper that is chosen when no mapper is specified in the MapReduce driver class.

    It implements identity function, which directly writes all its key-value pairs into output.

    It defined in old MapReduce API (MR1) in: org.apache.Hadoop.mapred.lib.package

    This class is used to run multiple mappers in a single map task.

    The output of the first mapper becomes the input to the second mapper, second to third and so on.

    Q61.What are the major configuration parameters required in a MapReduce program?


    We need to have the following configuration parameters:

    • Input location of the job in HDFS
    • Output location of the job in HDFS
    • Input and output formats
    • Classes containing a map and reduce functions
    • JAR file for mapper, reducer and driver classes 

    Q62.What do you mean by map-side join and reduce-side join in MapReduce?


    • Map-side join
    • Reduce-side join

    Each input data must be divided into the same number of partitions

    Input to each map is in the form of a structured partition and is in sorted order

    The reducer performs the join

    Easier to implement than the map side join, as the sorting and shuffling phase sends the value with identical keys to the same reducer

    No need to have the dataset in a structured form (or partitioned)

    Q63.What is the role of the OutputCommitter class in a MapReduce job?


    As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.

    Example: org.apache.hadoop.mapreduce.OutputCommitter

    public abstract class OutputCommitter extends OutputCommitter

    MapReduce relies on the OutputCommitter for the following:

    • Set up the job initialization 
    • Cleaning up the job after the job completion 
    • Set up the tasks temporary output
    • Check whether a task needs a commit
    • Committing the task output
    • Discard the task commit.

    Q64.Explain the process of spilling in MapReduce.


    Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.

    For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB. 

    Q65.How can you set the mappers and reducers for a MapReduce job?


    The number of mappers and reducers can be set in the command line using:

    -D mapred.map.tasks=5 –D mapred.reduce.tasks=2

    In the code, one can configure JobConf variables:

    job.setNumMapTasks(5); // 5 mappers

    job.setNumReduceTasks(2); // 2 reducers.

    Q66.What happens when a node running a map task fails before sending the output to the reducer?


    If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.

    Q67.Can we write the output of MapReduce in different formats?


    Yes. Hadoop supports various input and output formats, such as:

    • TextOutputFormat – This is the default output format and it writes records as lines of text. 

    SequenceFileOutputFormat – This is used to write sequence files when the output files need to be fed into another MapReduce job as input files.

    • MapFileOutputFormat – This is used to write the output as map files. 

    SequenceFileAsBinaryOutputFormat – This is another variant of SequenceFileInputFormat. It writes keys and values to a sequence file in binary format.

    DBOutputFormat – This is used for writing to relational databases and HBase. This format also sends the reduce output to a SQL table.

    Q68.What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?


    In Hadoop v1, Map Reduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling.

    Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldnt run in v1.

    In Hadoop v2, the following features are available:

    • Scalability – You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks. 
    • Compatibility – The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.
    • Resource utilization – YARN allows the dynamic allocation of cluster resources to improve resource utilization.
    • Multitenancy – YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.

    Q69.Explain how YARN allocates resources to an application with the help of its architecture. 


    There is a client/application/API which talks to ResourceManager. The ResourceManager manages the resource allocation in the cluster. It has two internal components, scheduler, and application manager. The ResourceManager is aware of the resources that are available with every node manager. The scheduler allocates resources to various running applications when they are running in parallel. It schedules resources based on the requirements of the applications. It does not monitor or track the status of the applications.

    Q70.Which of the following has replaced JobTracker from MapReduce v1?


    • NodeManager
    • ApplicationManager
    • ResourceManager 
    • Scheduler
    Big Data Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

    Q71.Write the YARN commands to check the status of an application and kill an application.


    The commands are as follows:

    • To check the status of an application:

                yarn application -status Application ID

    • To kill or terminate an application:

     yarn application -kill Application ID.

    Q72.Can we have more than one ResourceManager in a YARN-based cluster?


    Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.

    There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.

    Q73.What are the different schedulers available in YARN?


    The different schedulers available in YARN are:

    FIFO scheduler – This places applications in a queue and runs them in the order of submission (first in, first out). It is not desirable, as a long-running application might block the small running applications 

    Capacity scheduler – A separate dedicated queue allows the small job to start as soon as it is submitted. The large job finishes later compared to using the FIFO scheduler 

    Fair scheduler – There is no need to reserve a set amount of capacity since it will dynamically balance resources between all the running jobs.

    Q74.What happens if a ResourceManager fails while executing an application in a high availability cluster?


    In a high availability cluster, there are two ResourceManagers: one active and the other standby. If a ResourceManager fails in the case of a high availability cluster, the standby will be elected as active and instructs the ApplicationMaster to abort. The ResourceManager recovers its running state by taking advantage of the container statuses sent from all node managers.

    Q75.In a cluster of 10 DataNodes, each having 16 GB RAM and 10 cores, what would be the total processing capacity of the cluster?


    Every node in a Hadoop cluster will have one or multiple processes running, which would need RAM. The machine itself, which has a Linux file system, would have its own processes that need a specific amount of RAM usage. Therefore, if you have 10 DataNodes, you need to allocate at least 20 to 30 percent towards the overheads, Cloudera-based services, etc. You could have 11 or 12 GB and six or seven cores available on every machine for processing. Multiply that by 10, and that’s your processing capacity.

    Q76.What happens if requested memory or CPU cores go beyond the size of container allocation?


    If an application starts demanding more memory or more CPU cores that cannot fit into a container allocation, your application will fail. This happens because the requested memory is more than the maximum container size.

    Now that you have learned about HDFS, MapReduce, and YARN, let us move to the next section. Well go over questions about Hive, Pig, HBase, and Sqoop.

    HIVE Interview Questions

    NameNodes work all the time

    One at a time, active NameNodes will be up and running, while standby NameNodes will be idle and updating its metadata once in a while

    It requires two separate machines. First, the active NameNode will be configured, while the secondary NameNode will be configured on the other system

    Q77. What is the difference between an external table and a managed table in Hive?


    • External Table
    • Managed Table

    External tables in Hive refer to the data that is at an existing location outside the warehouse directory

    Hive deletes the metadata information of a table and does not change the table data present in HDFS

    Also known as the internal table, these types of tables manage the data and move it into its warehouse directory by default 

    If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory

    Q78.What is a partition in Hive and why is partitioning required in Hive?


    Partition is a process for grouping similar types of data together based on columns or partition keys. Each table can have one or more partition keys to identify a particular partition. Partitioning provides granularity in a Hive table. It reduces the query latency by scanning only relevant partitioned data instead of the entire data set. We can partition the transaction data for a bank based on month — January, February, etc. Any operation regarding a particular month, say February, will only have to scan the February partition, rather than the entire table data.

    Q79.Why does Hive not store metadata information in HDFS?


    We know that the Hives data is stored in HDFS. However, the metadata is either stored locally or it is stored in RDBMS. The metadata is not stored in HDFS, because HDFS read/write operations are time-consuming. As such, Hive stores metadata information in the metastore using RDBMS instead of HDFS. This allows us to achieve low latency and is faster.

    Q80.What are the components used in Hive query processors?


    The components used in Hive query processors are:

    • Parser
    • Semantic Analyzer
    • Execution Engine
    • User-Defined Functions
    • Logical Plan Generation
    • Physical Plan Generation
    • Optimizer
    • Operators
    • Type checking

    Q81.Write a query to insert a new column(new_col INT) into a hive table (h_table) at a position before an existing column (x_col).


    The following query will insert a new column:

    • ALTER TABLE h_table
    • CHANGE COLUMN new_col INT
    • BEFORE x_col.

    Q82.What are the key differences between Hive and Pig?


    It uses a declarative language, called HiveQL, which is similar to SQL for reporting.

    Operates on the server-side of the cluster and allows structured data.

    It does not support the Avro file format by default. This can be done using Org.Apache.Hadoop.Hive.serde2.Avro

    Facebook developed it and it supports partition

    Uses a high-level procedural language called Pig Latin for programming

    Operates on the client-side of the cluster and allows both structured and unstructured data

    Supports Avro file format by default.

    Yahoo developed it, and it does not support partition

    Pig Interview Questions

    Q83.How is Apache Pig different from MapReduce?


    It has fewer lines of code compared to MapReduce.

    • A high-level language that can easily perform join operation.

    On execution, every Pig operator is converted internally into a MapReduce job

    Works with all versions of Hadoop

    Has more lines of code.

    • A low-level language that cannot perform join operation easily.

    MapReduce jobs take more time to compile.

    A MapReduce program written in one Hadoop version may not work with other versions.

    Q84.What are the different ways of executing a Pig script?


    The different ways of executing a Pig script are as follows:

    • Grunt shell
    • Script file
    • Embedded script.

    Q85. What are the major components of a Pig execution environment?


    The major components of a Pig execution environment are:

    • Pig Scripts: They are written in Pig Latin using built-in operators and UDFs, and submitted to the execution environment.
    • Parser: Completes type checking and checks the syntax of the script. The output of the parser is a Directed Acyclic Graph (DAG).
    • Optimizer: Performs optimization using merge, transform, split, etc. Optimizer aims to reduce the amount of data in the pipeline.
    • Compiler: Converts the optimized code into MapReduce jobs automatically.

    Execution Engine: MapReduce jobs are submitted to execution engines to generate the desired results.

    Q86.Explain the different complex data types in Pig.


    Pig has three complex data types, which are primarily Tuple, Bag, and Map.


    Q87.What are the various diagnostic operators available in Apache Pig?


    Pig has Dump, Describe, Explain, and Illustrate as the various diagnostic operators.

    • Dump The dump operator runs the Pig Latin scripts and displays the results on the screen.
    • Load the data using the loadoperator into Pig.
    • Display the results using the dumpoperator.
    • Describe 
    • Describe operator is used to view the schema of a relation.
    • Load the data using loadoperator into Pig
    • View the schema of a relation using describeoperator

    Q88.State the usage of the group, order by, and distinct keywords in Pig scripts.


    The group statement collects various records with the same key and groups the data in one or more relations.

    Example: Group_data = GROUP Relation_name BY AGE

    The order statement is used to display the contents of relation in sorted order based on one or more fields.

    Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)

    Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.

    Example: Relation_2 = DISTINCT Relation_name1

    Q89.What are the relational operators in Pig?


    The relational operators in Pig are as follows:

    • COGROUP:It joins two or more tables and then performs GROUP operation on the joined table result.
    • CROSS: This is used to compute the cross product (cartesian product) of two or more relations.
    • FOREACH: This will iterate through the tuples of a relation, generating a data transformation.
    • JOIN: This is used to join two or more tables in a relation.
    • LIMIT: This will limit the number of output tuples.
    • SPLIT: This will split the relation into two or more relations.
    • UNION: It will merge the contents of two relations.
    • ORDER: This is used to sort a relation based on one or more fields.

    Q90.What is the use of having filters in Apache Pig?


    FilterOperator is used to select the required tuples from a relation based on a condition. It also allows you to remove unwanted records from the data file.

    Example: Filter the products with a whole quantity that is greater than 1000

    A = LOAD /user/Hadoop/phone_salesUSING PigStorage(,) AS (year:int, product:chararray, quantity:int);

    B = FILTER A BY quantity >1000


    Q91.Suppose theres a file called test.txthaving 150 records in HDFS. Write the PIG command to retrieve the first 10 records of the file.


    To do this, we need to use the limit operator to retrieve the first 10 records from a file.

    • Load the data in Pig:
    • test_data = LOAD /user/test.txtUSING PigStorage(,) as (field1, field2,….);
    • Limit the data to first 10 records:
    • Limit_test_data = LIMIT test_data 10;

    Q92.What are the key components of HBase?


    This is one of the most common interview questions. 

    • Region Server:

    Region server contains HBase tables that are divided horizontally into Regionsbased on their key values. It runs on every node and decides the size of the region. Each region server is a worker node that handles read, writes, updates, and delete request from clients.


    • HMaster:

    This assigns regions to RegionServers for load balancing, and monitors and manages the Hadoop cluster. Whenever a client wants to change the schema and any of the metadata operations, HMaster is used.


    • ZooKeeper:

    This provides a distributed coordination service to maintain server state in the cluster. It looks into which servers are alive and available, and provides server failure notifications. Region servers send their statuses to ZooKeeper indicating if they are ready to reading and write operations.

    Q93.Explain what row key and column families in HBase is.


    The row key is a primary key for an HBase table. It also allows logical grouping of cells and ensures that all cells with the same row key are located on the same server.

    Column families consist of a group of columns that are defined during table creation, and each column family has certain column qualifiers that a delimiter separates. 

    row key.

    Q94. Why do we need to disable a table in HBase and how do you do it?


    The HBase table is disabled to allow modifications to its settings. When a table is disabled, it cannot be accessed through the scan command.


    • To disable the employee table, use the command:

    disable employee_table

    • To check if the table is disabled, use the command:

    Q95.Write the code needed to open a connection in HBase.


    The following code is used to open a connection in HBase:

    Configuration myConf = HBaseConfiguration.create();

    HTableInterface usersTable = new HTable(myConf, users);

    Q95.If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?


    By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in t

    Before moving into the Hive interview questions, let us summarize what Hive is all about. Facebook adopted the Hive to overcome MapReduces limitations. MapReduce proved to be difficult for users as they found it challenging to code because not all of them were well-versed with the coding languages. Users required a language similar to SQL, which was well-known to all the users. This gave rise to Hive.

    Q96.What does replication mean in terms of HBase?


    The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.

    The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.

    • disable hbase1
    • alter hbase1, {NAME =>family_name, REPLICATION_SCOPE =>1}
    • enable hbase1.

    Q97.Can you import/export in an HBase table?


    Yes, it is possible to import and export tables from one HBase cluster to another. 

    • HBase export utility: hbase org.apache.hadoop.hbase.mapreduce.Export table nametarget export location
    • HBase import utility:create emp_table_import, {NAME =>myfam, VERSIONS =>10}hbase org.apache.hadoop.hbase.mapreduce.Import table nametarget import locationExample: create emp_table_import, {NAME =>myfam, VERSIONS =>10}

    Q98.What is compaction in HBase?


    Compaction is the process of merging HBase files into a single file. This is done to reduce the amount of memory required to store the files and the number of disk seeks needed. Once the files are merged, the original files are deleted.

    Q99.How does Bloom filter work?


    The HBase Bloom filter is a mechanism to test whether an HFile contains a specific row or row-col cell. The Bloom filter is named after its creator, Burton Howard Bloom. It is a data structure that predicts whether a given element is a member of a set of data. These filters provide an in-memory index structure that reduces disk reads and determines the probability of finding a row in a particular file.

    Q100.Does HBase have any concept of the namespace?


    A namespace is a logical grouping of tables, analogous to a database in RDBMS. You can create the HBase namespace to the schema of the RDBMS database.

    • To create a namespace, use the command:

    create_namespace namespace name

    • To list all the tables that are members of the namespace, use the command: list_namespace_tables default
    • To list all the namespaces, use the command:
    • list_namespace.

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free