Wednesday, February 26, 2014

Understanding Apache Hive and Related concepts : A casual compilation


Note: During the learning process, this introductory article on Apache Hive and its closely related components was compiled from various sources. The main purpose of doing this is to get the quick overview of Hive and related stuffs at a single place so that reader will quickly have a grasp of what it is. It tries to answer (or at least covers) the common questions/topics related to Hive. Please follow the links for more details on that specific topic. It’s very informally presented (sorry about that J).

Keywords: Hive, Hcatalog, HiveQL, Hadoop, YARN, HortonWorks, HBase, Pig, Serializer, Deserializer, SerDe - serializer and deserializer, MapReduce, Apache Tez, Beeline Hive client, RCFile, ORCFile, HortonWorks Sandbox, External Table, Thrift.

Suggested reading: It is assumed that reader has basic understanding of Hadoop framework, HDFS, HBase, MapReduce.

Contents
1       What is Hive?. 2
1.1         What Hive is NOT?. 2
2       Hive and HCatalog – are they different components?. 3
3       Data Units: 4
4       Data Types: 4
5       Operators and functions: 4
6       Why do Pig and Hive exist when they seem to do much of the same thing?. 5
7       SeDe - Serialize/Deserializer. 5
8       External table. 6
9       Is it possible to use HBase to store data?. 6
10          Hive Configurations. 7
11          How can we run HiveQL from other programs – such as java programs?. 7
12          Miscellaneous. 8
12.1      Hortonworks Sandbox. 8
12.2      Improving performance of Hive x100 (stringer). 9





1         What is Hive?

Hive is a data warehouse infrastructure based on Hadoop framework. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[2][3] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[4]
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem
By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.
Apache Hive (don’t be confused with Apache Behive – a java framework on top of Struts. It helps to use the legacy code …. It has been retired since 2011).

1.1       What Hive is NOT?

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). It is for ad-hoc query, summarization etc.
Ad-hoc versus Predefined

Where it is in the Hadoop ecosystem?


Image source (Hortonworks.com)

2         Hive and HCatalog – are they different components?

HCatalog is a component of Hive. It stores/collects the meta data of tables. It supports many DDL queries as well. But without Hive, HCatalog can be used by other components – Pig, MapReduce functions also.
A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common.
WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using a HTTP (REST style) interface.

The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.

3         Data Units

In the order of granularity - Hive data is organized into: Databases, Tables, Partitions, Buckets (or clusters)
Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

4         Data Types

It supports primitives (integer, double etc.) and Complex typesStructs, Maps (key-value tuples), Arrays (indexable lists).
create external table books (
verb string,
object struct < bookID:string >,
published string
) row format 
serde 'org.openx.data.jsonserde.JsonSerDe'

5         Operators and functions

Built in, and user defined.
http://snowplowanalytics.com/blog/2013/02/08/writing-hive-udfs-and-serdes/ (Example of a user defined function – toupper)


6         Why do Pig and Hive exist when they seem to do much of the same thing?

Hive because of its SQL like query language is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data. Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying. A good overview of how this works is in Alan Gates posting on the Yahoo Developer blog titled Pig and Hive at Yahoo! From a technical point of view both Pig and Hive are feature complete so you can do tasks in either tool. However you will find one tool or the other will be preferred by the different groups that have to use Apache Hadoop. The good part is they have a choice and both tools work together.
Hive -When you want to query the data, when you need an answer to specific questions, if you are familiar with SQL.
Pig – for ETL, for preparing data for easier analysis, when you have a long series of steps to perform.

For example, Hive is commonly used at Facebook for analytical purposes.  Facebook promotes the Hive language and their employees frequently speak about Hive at Big Data and Hadoop conferences. However, Yahoo! is a big advocate for Pig Latin.  Yahoo! has one of the biggest Hadoop clusters in the world.  Their data engineers use Pig for data processing on their Hadoop clusters.



7         SeDe - Serialize/Deserializer

Write any format of data into table (file) and read from that by implementing custom Serializer and Deserializers.
SerDe: Se - Serializer, De-deserializer
While loading data into the table, that custom serializer/deserializer can be specified. So, don’t worry about the data format.
We can write our own custom Serializer/Deserializer.
When we load data (such as plain text files – logs, news texts etc.), use deserializer.



8         External table

For stage in purpose
Unless a table is specified as EXTERNAL it will be stored inside a folder specified by the configuration property hive.metastore.warehouse.dir. EXTERNAL tables points to any hdfs location for its storage. You still have to make sure that the data format is specified to match the data.

9         Is it possible to use HBase to store data?

Yes. But there are certain reasons why it is not preferred to use HBase (a column based database) as a data store of Hive.
The HBase interaction module is completely optional, so you have to make sure it and it’s HBase dependencies are available on Hive’s classpath.
The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema.
The response time should be within an acceptable limit called service level agreement (SLA), so user may use HBase directly.
Using Hive to interact with HBase

  

10    Hive Configurations


https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties contains Hive configurations from the query window.
One of the properties is hive.execution.engine (mr  – default, tez)
Hive execution engine is either MapReduce or tez (see hive in tez. Tez introduced in Hadoop Yarn (v ?) https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez).
Tez (http://incubator.apache.org/projects/tez.html, http://hortonworks.com/hadoop/tez/) is a new application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework. It was added in Feb, 2013. Tez is a part of Stringer initiative – making 100x performance improvements at petabyte scale with familiar SQL semantics.
It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks, which consist of a triplet of interfaces: input, output and processor. These tasks are the vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class citizens in Tez and together with the input/output interfaces greatly increase the flexibility of how data is transferred between tasks.
Tez also greatly extends the possible ways of which individual tasks can be linked together; In fact any arbitrary DAG can be executed directly in Tez.

Another property (parameter) is:
hive.exec.reducers.bytes.per.reducer (Default Value: 1000000000)
Size per reducer. The default is 1G, that is, if the input size is 10G then 10 reducers will be used.

11    How can we run HiveQL from other programs – such as java programs?

HiveServer supports only embedded mode (i.e. local HiveServer) commandline. Command line tool supported in HiveServer2 is Beeline (Jdbc client) – it can work with local as well as remote server.
HiveServer2 (HS2) is a server interface (introduced in Hive version 0.11) that enables remote clients to execute queries against Hive and retrieve the results


For your information, HiveServer2 has Thrift interface (developed by Facebook for scalable cross-language services development, define using Interface Definition Language – IDL) so that any type of client can directly connect to the HiveServer without serialization and deserialization 

12    Miscellaneous

Why ACID property is not sought in distributed and big data?
CAP Theorem (Brewer’s theorem)  http://en.wikipedia.org/wiki/CAP_theorem
CAP is often considered a justification for using weaker consistency models. Popular among these is BASE, an acronym for Basically Available Soft-state services with Eventual-consistency. In summary, the BASE methodology is characterized by high availability for first-tier services, leaving some kind of background cleanup mechanism to resolve any problems created by optimistic actions that later turn out to have violated consistency

12.1  Hortonworks Sandbox

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes! (http://hortonworks.com/products/hortonworks-sandbox/)



12.2  Improving performance of Hive x100 (stringer)



Image: Hortonworks.com





1 comment:

  1. If you are looking for a reputable contextual ad network, I recommend that you take a peek at Propeller Ads.

    ReplyDelete