Note: During the learning process, this introductory article on
Apache Hive and its closely related components was compiled from various
sources. The main purpose of doing this is to get the quick overview of Hive
and related stuffs at a single place so that reader will quickly have a grasp
of what it is. It tries to answer (or at least covers) the common questions/topics related to Hive.
Please follow the links for more details on that specific topic. It’s very
informally presented (sorry about that J).
Keywords: Hive, Hcatalog, HiveQL, Hadoop, YARN, HortonWorks, HBase,
Pig, Serializer, Deserializer, SerDe - serializer and deserializer, MapReduce, Apache Tez, Beeline Hive client, RCFile,
ORCFile, HortonWorks Sandbox, External Table, Thrift.
Suggested reading:
It is assumed that reader has basic understanding of Hadoop framework, HDFS,
HBase, MapReduce.
Contents
1 What is Hive?
1.1 What Hive is NOT?
2 Hive and HCatalog – are they
different components?
3 Data Units:
4 Data Types:
5 Operators and functions:
6 Why do Pig and Hive exist when
they seem to do much of the same thing?
7 SeDe - Serialize/Deserializer
8 External table
9 Is it possible to use HBase to
store data?
10 Hive
Configurations
11 How
can we run HiveQL from other programs – such as java programs?
12 Miscellaneous
12.1 Hortonworks Sandbox
1
What is Hive?
Hive is a data warehouse
infrastructure based on Hadoop framework. Hive is designed to enable easy data
summarization, ad-hoc querying and analysis of large volumes of data. It
provides a simple query language called Hive QL, which is based on SQL and which
enables users familiar with SQL to do ad-hoc querying, summarization and data
analysis easily. At the same time, Hive QL also allows traditional map/reduce
programmers to be able to plug in their custom mappers and reducers to do more
sophisticated analysis that may not be supported by the built-in capabilities
of the language.
Apache Hive is a data warehouse
infrastructure built on top of Hadoop for providing data summarization, query,
and analysis. While initially developed by Facebook, Apache Hive is now used
and developed by other companies such as Netflix.[2][3] Amazon maintains a
software fork of Apache Hive that is included in Amazon Elastic MapReduce on
Amazon Web Services.[4]
Apache Hive supports analysis of
large datasets stored in Hadoop's HDFS and compatible file systems such as
Amazon S3 filesystem
By default, Hive stores metadata
in an embedded Apache Derby database, and other client/server databases like
MySQL can optionally be used.
Apache Hive (don’t be confused with Apache Behive – a java
framework on top of Struts. It helps to use the legacy code …. It has been
retired since 2011).
1.1
What Hive is NOT?
Hive is not designed for online
transaction processing and does not offer real-time queries and row level
updates. It is best used for batch jobs over large sets of immutable data (like
web logs). It is for ad-hoc query, summarization etc.
Ad-hoc versus Predefined
Where it is in the Hadoop
ecosystem?
Image source (Hortonworks.com)
2
Hive and HCatalog – are they different
components?
HCatalog is a component of Hive. It stores/collects the meta
data of tables. It supports many DDL queries as well. But without Hive,
HCatalog can be used by other components – Pig, MapReduce functions also.
A workflow where data is loaded and normalized using Map
Reduce or Pig and then analyzed via Hive is very common.
WebHCat provides a service that you can use to run Hadoop MapReduce
(or YARN), Pig, Hive jobs or perform Hive metadata operations using a HTTP
(REST style) interface.
The function of HCatalog is to hold location and metadata
about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to
be decoupled from data location and metadata like the schema. Additionally
since HCatalog supports many tools, like Hive and Pig, the location and metadata
can be shared between tools. Using the open APIs of HCatalog other tools like
Teradata Aster can also use the location and metadata in HCatalog. In the
tutorials we will see how we can now reference data by name and we can inherit
the location and metadata.
3
Data Units
In the order of granularity - Hive data is
organized into: Databases, Tables, Partitions, Buckets (or clusters)
Note that it is not necessary for tables to be
partitioned or bucketed, but these abstractions allow the system to prune large
quantities of data during query processing, resulting in faster query
execution.
4
Data Types
It supports primitives (integer, double etc.) and Complex
types – Structs, Maps (key-value
tuples), Arrays (indexable lists).
create external table
books (
verb string,
object struct < bookID:string >,
published string
) row format
serde 'org.openx.data.jsonserde.JsonSerDe'
verb string,
object struct < bookID:string >,
published string
) row format
serde 'org.openx.data.jsonserde.JsonSerDe'
5
Operators and functions
Built in, and user defined.
(https://cwiki.apache.org/confluence/display/Hive/Tutorial,
https://cwiki.apache.org/confluence/display/Hive/Home)
http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/
(Example of a user defined function – toupper)
6
Why do Pig and Hive exist when they seem to do
much of the same thing?
Hive because of its SQL like query language is often used as
the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier
and more familiar to users who are
used to using SQL for querying data. Pig fits in through its data flow
strengths where it takes on the tasks of bringing data into Apache Hadoop and
working with it to get it into the form for querying. A good overview of how
this works is in Alan Gates posting on the Yahoo Developer blog titled Pig
and Hive at Yahoo! From a technical point of view both Pig and Hive are
feature complete so you can do tasks in either tool. However you will find one
tool or the other will be preferred by the different groups that have to use
Apache Hadoop. The good part is they have a choice and both tools work
together.
Hive -When you want to query the data, when you need an
answer to specific questions, if you are familiar with SQL.
Pig – for ETL, for preparing data for easier analysis, when
you have a long series of steps to perform.
For example, Hive is commonly
used at Facebook for analytical purposes.
Facebook promotes the Hive language and their employees frequently speak
about Hive at Big Data and Hadoop conferences. However, Yahoo! is a big
advocate for Pig Latin. Yahoo! has one
of the biggest Hadoop clusters in the world.
Their data engineers use Pig for data processing on their Hadoop
clusters.
7
SeDe - Serialize/Deserializer
Write any format of data into table (file) and read from
that by implementing custom Serializer and Deserializers.
SerDe: Se - Serializer, De-deserializer
While loading data into the table, that custom
serializer/deserializer can be specified. So, don’t worry about the data
format.
We can write our own custom Serializer/Deserializer.
When we load data (such as plain text files – logs, news
texts etc.), use deserializer.
8
External table
For stage in purpose
Unless a table is specified as EXTERNAL it will be stored
inside a folder specified by the configuration property
hive.metastore.warehouse.dir. EXTERNAL tables points to any hdfs location for
its storage. You still have to make sure that the data format is specified to
match the data.
9
Is it possible to use HBase to store data?
Yes. But there are certain reasons why it is not preferred to
use HBase (a column based database) as a data store of Hive.
The HBase interaction module is completely optional, so you
have to make sure it and it’s HBase dependencies are available on Hive’s
classpath.
The most glaring issue barring real application development
is the impedance mismatch between Hive’s typed, dense schema and HBase’s
untyped, sparse schema.
The response time should be within an acceptable limit called
service level agreement (SLA), so user may use HBase directly.
Using Hive to interact with HBase
10
Hive
Configurations
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties contains Hive configurations from the query
window.
One of the properties is hive.execution.engine
(mr – default, tez)
Hive execution engine is
either MapReduce or tez (see hive in tez. Tez introduced in Hadoop Yarn (v ?) https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez).
Tez (http://incubator.apache.org/projects/tez.html, http://hortonworks.com/hadoop/tez/) is a new application framework built on Hadoop Yarn that can
execute complex directed acyclic graphs of general data processing tasks. In
many ways it can be thought of as a more flexible and powerful successor of the
map-reduce framework. It was added in Feb, 2013. Tez is a part of Stringer
initiative – making 100x performance improvements at petabyte scale with
familiar SQL semantics.
It
generalizes map and reduce tasks by exposing interfaces for generic data
processing tasks, which consist of a triplet of interfaces: input, output and
processor. These tasks are the vertices in the execution graph. Edges (i.e.:
data connections between tasks) are first class citizens in Tez and together
with the input/output interfaces greatly increase the flexibility of how data
is transferred between tasks.
Tez
also greatly extends the possible ways of which individual tasks can be linked
together; In fact any arbitrary DAG can be executed directly in Tez.
Another property (parameter)
is:
hive.exec.reducers.bytes.per.reducer
(Default Value: 1000000000)
Size per reducer. The
default is 1G, that is, if the input size is 10G then 10 reducers will be used.
11
How can we run
HiveQL from other programs – such as java programs?
HiveServer supports only embedded mode (i.e. local
HiveServer) commandline. Command line tool supported in HiveServer2 is Beeline
(Jdbc client) – it can work with local as well as remote server.
HiveServer2 (HS2) is a server interface (introduced in Hive
version 0.11) that enables remote clients to execute queries against Hive and
retrieve the results
For your information, HiveServer2 has Thrift interface
(developed by Facebook for scalable cross-language services development, define
using Interface Definition Language – IDL) so that any type of client can
directly connect to the HiveServer without serialization and deserialization
12
Miscellaneous
Why ACID property is not sought in distributed and big data?
CAP Theorem (Brewer’s theorem) http://en.wikipedia.org/wiki/CAP_theorem
CAP is often considered a justification for using weaker
consistency models. Popular among these is BASE, an acronym for Basically Available Soft-state
services with Eventual-consistency.
In summary, the BASE methodology is characterized by high availability for
first-tier services, leaving some kind of background cleanup mechanism to
resolve any problems created by optimistic actions that later turn out to have
violated consistency
12.1 Hortonworks
Sandbox
Sandbox is a personal, portable
Hadoop environment that comes with a dozen interactive Hadoop tutorials.
Sandbox includes many of the most exciting developments from the latest HDP
distribution, packaged up in a virtual environment that you can get up and running
in 15 minutes! (http://hortonworks.com/products/hortonworks-sandbox/)
12.2 Improving
performance of Hive x100 (stringer)
Image: Hortonworks.com
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
(Optimized row columnar)