Monday, October 12, 2009

Hive vs. Pig

While I was looking at Hive and Pig for processing large amounts of data without the need to write MapReduce code I found that there is no easy way to compare them against each other without reading into both in greater detail.

In this post I am trying to give you a 10,000ft view of both and compare some of the more prominent and interesting features. The following table - which is discussed below - compares what I deemed to be such features:

Feature

Hive

Pig

Language

SQL-like

PigLatin

Schemas/Types

Yes (explicit)

Yes (implicit)

Partitions

Yes

No

Server

Optional (Thrift)

No

User Defined Functions (UDF)

Yes (Java)

Yes (Java)

Custom Serializer/Deserializer

Yes

Yes

DFS Direct Access

Yes (implicit)

Yes (explicit)

Join/Order/Sort

Yes

Yes

Shell

Yes

Yes

Streaming

Yes

Yes

Web Interface

Yes

No

JDBC/ODBC

Yes (limited)

No



Let us look now into each of these with a bit more detail.

General Purpose

The question is "What does Hive or Pig solve?". Both - and I think this lucky for us in regards to comparing them - have a very similar goal. They try to ease the complexity of writing MapReduce jobs in a programming language like Java by giving the user a set of tools that they may be more familiar with (more on this below). The raw data is stored in Hadoop's HDFS and can be any format although natively it usually is a TAB separated text file, while internally they also may make use of Hadoop's SequenceFile file format. The idea is to be able to parse the raw data file, for example a web server log file, and use the contained information to slice and dice them into what is needed for business needs. Therefore they provide means to aggregate fields based on specific keys. In the end they both emit the result again in either text or a custom file format. Efforts are also underway to have both use other systems as a source for data, for example HBase.

The features I am comparing are chosen pretty much at random because they stood out when I read into each of these two frameworks. So keep in mind that this is a subjective list.

Language

Hive lends itself to SQL. But since we can only read already existing files in HDFS it is lacking UPDATE or DELETE support for example. It focuses primarily on the query part of SQL. But even there it has its own spin on things to reflect better the underlaying MapReduce process. Overall is seems that someone familiar with SQL can very quickly learn Hive's version of it and get results fast.

Pig on the other hand looks more like a very simplistic scripting language. As with those (and this is a nearly religious topic) some are more intuitive and some are less. As with PigLatin I was able to see what the samples do, but lacking the full knowledge of its syntax I was somewhat finding myself thinking if I really would be able to get what I needed without too many trial-and-error loops. Sure, the Hive SQL needs probably as many iterations to fully grasp - but there is at least a greater understanding of what to expect.

Schemas/Types

Hive uses once more a specific variation of SQL's Data Definition Language (DDL). It defines the "tables" beforehand and stores the schema in a either shared or local database. Any JDBC offering will do, but it also comes with a built in Derby instance to get you started quickly. If the database is local then only you can run specific Hive commands. If you share the database then others can also run these - or would have to set up their own local database copy. Types are also defined upfront and supported types are INT, BIGINT, BOOLEAN, STRING and so on. There are also array types that lets you handle specific fields in the raw data files as a group.

Pig has no such metadata database. Datatypes and schemas are defined within each script. Types furthermore are usually automatically determined by their use. So if you use a field as an Integer it is handled that way by Pig. You do have the option though to override it and have explicit type definitions, again within the script you need them. Pig has a similar set of types compared to Hive. For example it also has an array type called "bag".

Partitions

Hive has a notion of partitions. They are basically subdirectories in HDFS. It allows for example processing a subset of the data by alphabet or date. It is up to the user to create these "partitions" as they are not enforced nor required.

Pig does not seem to have such a feature. It may be that filters can achieve the same but it is not immediately obvious to me.

Server

Hive can start an optional server, which is allegedly Thrift based. With the server I presume you can send queries from anywhere to the Hive server which in turn executes them.

Pig does not seem to have such a facility yet.

User Defined Functions

Hive and Pig allow for user functionality by supplying Java code to the query process. These functions can add any additional feature that is required to crunch the numbers as required.

Custom Serializer/Deserializer

Again, both Hive and Pig allow for custom Java classes that can read or write any file format required. I also assume that is how it connects to HBase eventually (just a guess). You can write a parser for Apache log files or, for example, the binary Tokyo Tyrant Ulog format. The same goes for the output, write a database output class and you can write the results back into a database.

DFS Direct Access

Hive is smart about how to access the raw data. A "select * from table limit 10" for example does a direct read from the file. If the query is too complicated it will fall back to use a full MapReduce run to determine the outcome, just as expected.

With Pig I am not sure if it does the same to speed up simple PigLatin scripts. At least it does not seem to be mentioned anywhere as an important feature.

Join/Order/Sort

Hive and Pig have support for joining, ordering or sorting data dynamically. They perform the same purpose in both pretty allowing you to aggregate and sort the result as is needed. Pig also has a COGROUP feature that allows you to do OUTER JOIN's and so on. I think this is where you spent most of your time with either package - especially when you start out. But from a cursory look it seems both can do pretty much the same.

Shell

Both Hive and Pig have a shell that allows you to query specific things or run the actual queries. Pig also passes on DFS commands such as "cat" to allow you to quickly check what an outcome of a specific PigLatin script was.

Streaming

Once more, both frameworks seem to provide streaming interfaces so that you can process data with external tools or languages, such as Ruby or Python. How the streaming performs I do not know and if they affect them differently. This is for you to tell me :)

Web Interface

Only Hive has a web interface or UI that can be used to visualize the various schemas and issue queries. This is different to the above mentioned Server as it is an interactive web UI for a human operator. The Hive Server is for use from another programming or scripting language for example.

JDBC/ODBC

Another Hive only feature is the availability of a - again limited functionality - JDBC/ODBC driver. It is another way for programmers to use Hive without having to bother with its shell or web interface, or even the Hive Server. Since only a subset of features is available it will require small adjustments on the programmers side of things but otherwise seems like a nice-to-have feature.

Conclusion

Well, it seems to me that both can help you achieve the same goals, while Hive comes more natural to database developers and Pig to "script kiddies" (just kidding). Hive has more features as far as access choices are concerned. They also have reportedly roughly the same amount of committers in each project and are going strong development wise.

This is it from me. Do you have a different opinion or comment on the above then please feel free to reply below. Over and out!

8 comments:

  1. Hey Lars,

    Good comparison. Some other areas to compare features:

    1) Join optimizations. There are many in both, so I'm not going to list them here.

    2) User-defined aggregate functions. Both systems have them.

    3) Pig has a very nice ILLUSTRATE command that won best paper at SIGMOD 2009. Hive has a useful TABLESAMPLE command for grabbing sample that is far less sophisticated but usually quite useful. Maybe something about "debugging" or iterative query construction?

    4) Data model. Both Pig and Hive have data models which are slightly different from the relational data model. Hive has maps and lists; Pig has bags and some other stuff.

    5) DDL. Hive has some useful statments like "create table as select ..." (CTAS) and CREATE TABLE LIKE. Pig, since it lacks metadata (until Owl hits trunk), does not.

    6) Columnar storage: Hive has RCFile, Pig has Zebra.

    7) Multi-table insert. Both Hive and Pig have this.

    8) Interface with MapReduce. Hive allows arbitrary streaming scripts to be embedded in HiveQL commands using the TRANSFORM operator. I think Pig allows this as well, though not sure of the syntax.

    9) Types of JOINs supported. Outer, cartesian, etc. Not sure of the comparison here.

    10) Support for indexing. There's been some work in Hive on this problem, e.g. https://issues.apache.org/jira/browse/HIVE-678 and https://issues.apache.org/jira/browse/HIVE-417. I believe Zebra has some notion of index, though I'm not sure.

    11) Other optimizations: predicate pushdown, map-side aggregates, etc. Note that there's some work on a cost-based optimizer in Pig land (http://tr.im/BG1P), while Hive has some initial ideas about statistics gathering: http://issues.apache.org/jira/browse/HIVE-33.

    12) Performance. See https://issues.apache.org/jira/browse/HIVE-396.

    13) Integration with HBase. I thought you would have been all over this one! Both projects have tickets for HBase integration, but I'm not sure where either stands currently. Currently it's easy to load data from Hive into HBase: https://issues.apache.org/jira/browse/HIVE-758, but I don't think you can query data stored in HBase via Hive yet (https://issues.apache.org/jira/browse/HIVE-806). Pig has a resolved ticket for querying data stored in HBase via Pig, but I'm not sure if it still works against trunk: https://issues.apache.org/jira/browse/PIG-6.

    Thanks for starting this discussion!

    ReplyDelete
  2. One last thought: expanding on your schemas/types, it would be good to explore support for common data types like date/time, geo, etc. The Sawzall paper made a big deal about their support for date/time stuff...

    ReplyDelete
  3. Hi Jeff, thank you for the feedback! As far as HBase support is concerned I am certainly very much interested. But I see this as rather a secondary use-case myself as I am using HBase directly. But having these two nice front end systems available certainly adds to the quality of HBase itself.

    I will read through all the great input you gave, this is a great way to gain more insight into both frameworks. Again, thank you for sharing!

    ReplyDelete
  4. Another feature to add to the list is handling of skewed data. If your data is not evenly distributed (eg across join or sort keys), this can greatly affect the runtime of the query- few of the tasks can get much larger share of the processing.
    For example pig has a special join mode (skew-join) which users can use to query over data whose join skew distribution in data is not even. It samples the data and uses that information to distribute the load evenly. Pig order-by command also similarly samples the data first. (Pig 'order by' statement does global sorting of data in a scalable fashion (multiple map/reduce tasks), hive sort-by sorts within each reduce task)

    Something to keep in mind is that both systems are rapidly evolving and new features keep getting added. For example, SQL will also be supported in pig, in addition to pig-latin.

    ReplyDelete
  5. I think Pig can support partitioning in exactly the same way that Hive does: HDFS sub dirs. Pig's load function accepts globs, which you can construct to target files anywhere in your HDFS cluster.

    ReplyDelete
  6. I would like to add one more difference:

    In hive, we should define Metadata before loading the data.
    In Pig, no need to define metadata, while querying we have to specify the metadata.so it's flexible to change the metadata types.

    thansk
    chandra

    ReplyDelete

  7. Very good analysis and summary.

    I look at Hive as being an open sourced
    DWS for the Hadoop Framework.

    Pig / PigLatin is a data_Flow language that is
    geared up for Big_Data Information-Streams.

    VBR/ Wallis Dudhnath

    ReplyDelete
  8. Nice Comparision/discussion

    Regarding Metadata, HCATALOG(Merged with Hive v11) can use be used, to get the metadata defined in Hive metastore , to be used in pig. It would be helpful to manage your metadata across pig, hive and mapreduce.

    ReplyDelete