Apache Pig
- Alternative way to write mapreduce/tez code
- Uses
Pig Latin
as scripting language with aSQL-like syntax
- Extensible with
user-defined functions
(UDF) - Running pig
- From grunt CLI
- Script filename
- Ambari/Hue UI
- Using pig with Tez is faster because it uses
directed acyclic graph
Commands
FILTER
: filter dataDISTINCT
: select unique valuesFOREACH/GENERATE
: allows mapping dataMAPREDUCE
: explicit mappers and reducersSTREAM
: use stdin/stdout-
SAMPLE
: sample from relation -
JOIN BY
: join tables COGROUP
: separate tuple for each keyGROUP BY
: groub by. Aggregate different keysCROSS
: cross join-
CUBE
: cross join for multiple tables -
ORDER
: order by RANK
: assigns a number for each row-
LIMIT
: max number of results -
UNION
: merge relations -
SPLIT
: split relations -
DESCRIBE
: describes the columns of a relation EXPLAIN
: explains how the query will be executed-
ILLUSTRATE
: explain but more detailed with examples -
Input/Output
-
LOAD
: read data from a file STORE
: write data to a file (STORE ratings INTO 'outRatings' USING PigStorage(':');)-
DUMP
: print result -
User Defined Functions
-
REGISTER
: UDF from jar files DEFINE
: assign names to UDFs-
IMPORT
: import macros from other pig scripts -
Aggregation functions
-
AVG
CONCAT
COUNT
MAX
MIN
SIZE
-
SUM
-
Pig Storage Classes
-
PigStorage
: set the delimiter (; , |) of the input/output data TextLoader
: loads one line at timeJsonLoader
: loads from jsonAvroStorage
: loads from avroParquetLoader
: loads from parquet (this is a column-oriented data structure)OrcStorage
: loads from orc (compressed format)HBaseStorage
: integrate pig with hbase