Apache Impala: A Concise Tutorial Just An Hour | ACTE
Apache Impala Tutorial

Apache Impala: A Concise Tutorial Just An Hour – FREE

Last updated on 09th Jul 2020, Blog, Tutorials

About author

Kumar (Hadoop Developer )

Kumar is a Python developer with 7+ years of experience in the Hadoop ecosystem, Sqoop, Hive, Spark, Scala, HBase, MapReduce, and NoSQL databases, such as HBase, Cassandra, and MongoDB. He spends most of his time researching technology and startups.

(5.0) | 18612 Ratings 1619

Impala is the open source, native analytic database for Apache Hadoop. It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon. The examples provided in this tutorial have been developing using Cloudera Impala

Apache Impala

  • Apache Impala is a massively parallel processing query engine that executes on Hadoop platform. It is an open source software, which was developed on the basis of Google’s Dremel paper.
  • It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Moreover, Impala is used as a scalable parallel database technology provided to Hadoop, which enables the users to issue low-latency SQL queries to the data stored in HDFS and Apache HBase without requiring data movement or transformation.
  • Impala integrates with the Apache Hive metastore database to share database table information between both the components.  Analysts and Data Scientists perform analytical operations on data stored in Hadoop and advertises it via SQL  tools. It helps to provide large-scale data processing (via MapReduce), and interactive queries. Furthermore, it can be processed on the same system using the same data and metadata, which helps to eliminate the need to shift data sets into specialized systems.

    Subscribe For Free Demo

    [custom_views_post_title]

    Apache Impala installation

    General build requirements

    • sudo apt-get install build-essential git maven
    • sudo apt-get install openjdk-8-jdk
    • sudo apt-get install libpython-dev cmake
    • sudo apt-get install libssl-dev libsasl2-dev

    Build the test environment

    Install Postgres

    • sudo apt-get install postgresql
    • sudo service postgresql start

    Edited pg_hba.conf to change “peer” to “trust”

    • vim  /etc/postgresql/*/main/pg_hba.conf
    • Go to end of file (shift+G) Replace peer, ident, md5 by trust in last 4 lines
    • sudo -u postgres psql postgres 
    • CREATE ROLE hiveuser LOGIN PASSWORD ‘password’;
    • ALTER ROLE hiveuser WITH CREATEDB;

    Setup pass wordless ssh for hbase

    • sudo apt-get install openssh-server
    • sudo service ssh start
    • ssh-keygen -t dsa

    Do not type in any passkey. Just press enter.

    • cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

    Setup ntpd for Kudu

    • sudo apt-get install ntp
    • sudo systemctl restart ntp.service

    Add a path for HDFS domain sockets

    • sudo mkdir /var/lib/hadoop-hdfs/

    Build native toolchain

    • sudo apt-get install build-essential git
    • sudo apt-get install bison
    • sudo apt-get install autoconf automake libtool
    • sudo apt-get install libz-dev
    • sudo apt-get install libssl-dev
    • sudo apt-get install libncurses-dev
    • sudo apt-get install libsasl2-dev libkrb5-dev

    Checkout and build the toolchain

    • git clone https://github.com/cloudera/native-toolchain.git
    • cd native-toolchain
    • ./buildall.sh

    Clone Impala

    • cd /etc
    • git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git Impala
    • cd Impala

    Export variables

    • export JAVA_HOME=/usr/lib/jvm/java-7-oracle
    • export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
    • export LC_ALL=”en_US.UTF-8″
    • export M2_HOME=/usr/share/maven
    • export M2=$M2_HOME/bin
    • export PATH=$M2:$PATH
    • export IMPALA_HOME=/etc/Impala/
    • export BOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu

    Using custom tool chain with Impala

    • cd ${IMPALA_HOME}
    • (mkdir -p toolchain && cd toolchain && ln -s ${NATIVE_TOOL CHAIN_HOME}/build/* .)
    • export SKIP_TOOLCHAIN_BOOTSTRAP=true
    • export IMPALA_TOOLCHAIN=${IMPALA_HOME}/toolchain

    Build the packages

    • ./buildall.sh -noclean -notests -skiptests

    Building local mini cluster(first time only)

    • ${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format

    Start services of cluster

    • source ${IMPALA_HOME}/bin/impala-config.sh
    • ${IMPALA_HOME}/bin/start-impala-cluster.p

    Verify installation

    source ${IMPALA_HOME}/bin/impala-config.sh # If you didn’t already source impala-config.sh in this shell

    • impala-shell.sh -q “SELECT version()”

    Starting Impala Shell without Kerberos authentication

    Connected to localhost:21000

    Server version: impalad version 2.2.0-INTERNAL DEBUG (build 47c90e004aecb928a37b926080098d30b96b4330)

    Query: select version()

    • version()
    • impalad version 2.2.0-INTERNAL DEBUG (build 47c90e004aecb928a37b926080098d30b96b4330)
    • Built on Sun, Mar 22 15:22:57 PDT 2015
    • Fetched 1 row(s) in 0.05s
    Starting Impala Shell
    • To start Impala, open the terminal and execute the following command.
    • [cloudera@quickstart ~] $ impala-shell
    • This will start the Impala Shell, displaying the following message.

    Starting Impala Shell without Kerberos authentication 

    Connected to quickstart.cloudera:21000 

    Course Curriculum

    Learn Impala Certification Course from Expert Trainers

    Weekday / Weekend BatchesSee Batch Details

    Server version:

    • impalad version 2.3.0-cdh5.5.0 RELEASE (build0c891d79 aa 38f297d244855 a32f1e17280e2129b) 

     Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. 

    • (Impala Shell v2.3.0-cdh5.5.0 (0c891d7) built on Mon Nov 9 12:18:12 PST 2015)

    Press TAB twice to see a list of available commands. 

    • [quickstart.cloudera:21000] >

    Note: We will discuss all the impala-shell commands in later chapters.

    Impala Query editor

    In addition to Impala shell, you can communicate with Impala using the Hue browser. After installing CDH5 and starting Impala, if you open your browser, you will get the cloudera homepage as shown below.

    Impala-Query-editor

    Now, click the bookmark Hue to open the Hue browser. On clicking, you can see the login page of the Hue Browser, logging with the credentials cloudera and cloudera.

    hue-browser

    As soon as you log on to the Hue browser, you can see the Quick Start Wizard of Hue browser as shown below.

    quick-start-wizard

    On clicking the Query Editors drop-down menu, you will get the list of editors Impala supports as shown in the following screenshot.

    query-editor

    On clicking Impala in the drop-down menu, you will get the Impala query editor as shown below.

    impala-query-editor-navigator

    Impala Features

    Impala provides support for:

    1. Impala offers support for most common SQL-92 features of Hive Query Language (HiveQL). It includes SELECT, joins, and aggregate functions.

    2. Moreover, it also provides support for HDFS, HBase, and Amazon Simple Storage System (S3) storage. It includes:

    • HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile.
    • Compression codecs: Snappy, GZIP, Deflate, BZIP.

    3. Also, supports common data access interfaces. Includes:

    • JDBC driver.
    • ODBC driver.

    4. However, it supports Hue Beeswax and the Impala Query UI.

    5. Also, supports impala-shell command-line interface.

    6. Moreover, supports Kerberos authentication.

    Use Impala
    • Impala provides parallel processing database technology on top of Hadoop eco-system. So, it can smoothly perform low latency queries interactively.
    • Impala is a time-saving job which gives results in seconds whereas, in Hive MapReduce, it takes time in launching and processing queries.
    • Impala is also beneficial for Analytics and Data Scientists to perform analytics on data stored in Hadoop File System with the help of real-time query engine.
    • Because of providing real-time results, it works perfectly for reporting tools or visualization tools like Pentaho.
    • Impala provides in-built support of processing all of the Hadoop supported file formats (ORC, Parquet.etc.). This project provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats.
    Impala Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download
    Advantages of Impala

    There are several advantages of Impala which are given as follows:-

    • Fast Speed: We can process data in HDFS  at very fast speed by using Impala.
    • Migrating data is not necessary: We don’t need to transform and move data store on Hadoop even if the data processing is carried where the data resides.
    • Big Data: A user can easily store and manage a large amount of data.
    • Languages: Impala does not have issue respect to language support, because it supports all languages.
    • High Performance: It offers high performance and low latency task for Hadoop.
    • Distributed: It provides a distributive environment in which a query is distributed among different clusters for reducing workload and provides convenient scalability.
    • Easy Access: We can easily access the data that is stored in HDFS, HBase, and Amazon s3 without requiring the knowledge of  Java.
    Conclusion:
    • Hence, in this Impala Tutorial for beginners, we have seen the complete lesson to Impala. Still, if any query occurs in Impala tutorial, feel free to ask in the comment section. 

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free