Articles Tutorials Interview Questions

Tutorial Playlist

Apache Impala: A Concise Tutorial Just An Hour – FREE

Prev Next

Last updated on 09th Jul 2020| 2674

(5.0) | 18612 Ratings E-mail this post

Impala is the open source, native analytic database for Apache Hadoop. It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon. The examples provided in this tutorial have been developing using Cloudera Impala

Apache Impala

Apache Impala is a massively parallel processing query engine that executes on Hadoop platform. It is an open source software, which was developed on the basis of Google’s Dremel paper.
It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Moreover, Impala is used as a scalable parallel database technology provided to Hadoop, which enables the users to issue low-latency SQL queries to the data stored in HDFS and Apache HBase without requiring data movement or transformation.
Impala integrates with the Apache Hive metastore database to share database table information between both the components. Analysts and Data Scientists perform analytical operations on data stored in Hadoop and advertises it via SQL tools. It helps to provide large-scale data processing (via MapReduce), and interactive queries. Furthermore, it can be processed on the same system using the same data and metadata, which helps to eliminate the need to shift data sets into specialized systems.

Apache Impala installation

General build requirements

sudo apt-get install build-essential git maven
sudo apt-get install openjdk-8-jdk
sudo apt-get install libpython-dev cmake
sudo apt-get install libssl-dev libsasl2-dev

Build the test environment

Install Postgres

sudo apt-get install postgresql
sudo service postgresql start

Edited pg_hba.conf to change “peer” to “trust”

vim /etc/postgresql/*/main/pg_hba.conf
Go to end of file (shift+G) Replace peer, ident, md5 by trust in last 4 lines
sudo -u postgres psql postgres
CREATE ROLE hiveuser LOGIN PASSWORD ‘password’;
ALTER ROLE hiveuser WITH CREATEDB;

Setup pass wordless ssh for hbase

sudo apt-get install openssh-server
sudo service ssh start
ssh-keygen -t dsa

Do not type in any passkey. Just press enter.

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Setup ntpd for Kudu

sudo apt-get install ntp
sudo systemctl restart ntp.service

Add a path for HDFS domain sockets

sudo mkdir /var/lib/hadoop-hdfs/

Build native toolchain

sudo apt-get install build-essential git
sudo apt-get install bison
sudo apt-get install autoconf automake libtool
sudo apt-get install libz-dev
sudo apt-get install libssl-dev
sudo apt-get install libncurses-dev
sudo apt-get install libsasl2-dev libkrb5-dev

Checkout and build the toolchain

git clone https://github.com/cloudera/native-toolchain.git
cd native-toolchain
./buildall.sh

Clone Impala

cd /etc
git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git Impala
cd Impala

Export variables

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
export LC_ALL=”en_US.UTF-8″
export M2_HOME=/usr/share/maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
export IMPALA_HOME=/etc/Impala/
export BOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu

Using custom tool chain with Impala

cd ${IMPALA_HOME}
(mkdir -p toolchain && cd toolchain && ln -s ${NATIVE_TOOL CHAIN_HOME}/build/* .)
export SKIP_TOOLCHAIN_BOOTSTRAP=true
export IMPALA_TOOLCHAIN=${IMPALA_HOME}/toolchain

Build the packages

./buildall.sh -noclean -notests -skiptests

Building local mini cluster(first time only)

${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format

Start services of cluster

source ${IMPALA_HOME}/bin/impala-config.sh
${IMPALA_HOME}/bin/start-impala-cluster.p

Verify installation

source ${IMPALA_HOME}/bin/impala-config.sh # If you didn’t already source impala-config.sh in this shell

impala-shell.sh -q “SELECT version()”

Starting Impala Shell without Kerberos authentication

Connected to localhost:21000

Server version: impalad version 2.2.0-INTERNAL DEBUG (build 47c90e004aecb928a37b926080098d30b96b4330)

Query: select version()

version()
impalad version 2.2.0-INTERNAL DEBUG (build 47c90e004aecb928a37b926080098d30b96b4330)
Built on Sun, Mar 22 15:22:57 PDT 2015
Fetched 1 row(s) in 0.05s

Starting Impala Shell

To start Impala, open the terminal and execute the following command.

[cloudera@quickstart ~] $ impala-shell

This will start the Impala Shell, displaying the following message.

Starting Impala Shell without Kerberos authentication

Connected to quickstart.cloudera:21000

Server version:

impalad version 2.3.0-cdh5.5.0 RELEASE (build0c891d79 aa 38f297d244855 a32f1e17280e2129b)

(Impala Shell v2.3.0-cdh5.5.0 (0c891d7) built on Mon Nov 9 12:18:12 PST 2015)

Press TAB twice to see a list of available commands.

[quickstart.cloudera:21000] >

Note: We will discuss all the impala-shell commands in later chapters.

Impala Query editor

In addition to Impala shell, you can communicate with Impala using the Hue browser. After installing CDH5 and starting Impala, if you open your browser, you will get the cloudera homepage as shown below.

Now, click the bookmark Hue to open the Hue browser. On clicking, you can see the login page of the Hue Browser, logging with the credentials cloudera and cloudera.

As soon as you log on to the Hue browser, you can see the Quick Start Wizard of Hue browser as shown below.

On clicking the Query Editors drop-down menu, you will get the list of editors Impala supports as shown in the following screenshot.

On clicking Impala in the drop-down menu, you will get the Impala query editor as shown below.

Impala Features

Impala provides support for:

1. Impala offers support for most common SQL-92 features of Hive Query Language (HiveQL). It includes SELECT, joins, and aggregate functions.

2. Moreover, it also provides support for HDFS, HBase, and Amazon Simple Storage System (S3) storage. It includes:

HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile.
Compression codecs: Snappy, GZIP, Deflate, BZIP.

3. Also, supports common data access interfaces. Includes:

JDBC driver.
ODBC driver.

4. However, it supports Hue Beeswax and the Impala Query UI.

5. Also, supports impala-shell command-line interface.

6. Moreover, supports Kerberos authentication.

Use Impala

Impala provides parallel processing database technology on top of Hadoop eco-system. So, it can smoothly perform low latency queries interactively.
Impala is a time-saving job which gives results in seconds whereas, in Hive MapReduce, it takes time in launching and processing queries.
Impala is also beneficial for Analytics and Data Scientists to perform analytics on data stored in Hadoop File System with the help of real-time query engine.
Because of providing real-time results, it works perfectly for reporting tools or visualization tools like Pentaho.
Impala provides in-built support of processing all of the Hadoop supported file formats (ORC, Parquet.etc.). This project provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats.

Impala Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

Advantages of Impala

There are several advantages of Impala which are given as follows:-

Fast Speed: We can process data in HDFS at very fast speed by using Impala.
Migrating data is not necessary: We don’t need to transform and move data store on Hadoop even if the data processing is carried where the data resides.
Big Data: A user can easily store and manage a large amount of data.
Languages: Impala does not have issue respect to language support, because it supports all languages.
High Performance: It offers high performance and low latency task for Hadoop.
Distributed: It provides a distributive environment in which a query is distributed among different clusters for reducing workload and provides convenient scalability.
Easy Access: We can easily access the data that is stored in HDFS, HBase, and Amazon s3 without requiring the knowledge of Java.

Conclusion:

Hence, in this Impala Tutorial for beginners, we have seen the complete lesson to Impala. Still, if any query occurs in Impala tutorial, feel free to ask in the comment section.

Name	Date	Details
	30-June-2025 (Weekdays) Weekdays Regular
	02-July-2025 (Weekdays) Weekdays Regular
	5-July-2025 (Weekends) Weekend Regular
	6-July-2025 (Weekends) Weekend Fasttrack