Apache Pig
- Alternative way to write mapreduce/tez code
- Uses
Pig Latinas scripting language with aSQL-like syntax - Extensible with
user-defined functions(UDF) - Running pig
- From grunt CLI
- Script filename
- Ambari/Hue UI
- Using pig with Tez is faster because it uses
directed acyclic graph
Commands
FILTER: filter dataDISTINCT: select unique valuesFOREACH/GENERATE: allows mapping dataMAPREDUCE: explicit mappers and reducersSTREAM: use stdin/stdout-
SAMPLE: sample from relation -
JOIN BY: join tables COGROUP: separate tuple for each keyGROUP BY: groub by. Aggregate different keysCROSS: cross join-
CUBE: cross join for multiple tables -
ORDER: order by RANK: assigns a number for each row-
LIMIT: max number of results -
UNION: merge relations -
SPLIT: split relations -
DESCRIBE: describes the columns of a relation EXPLAIN: explains how the query will be executed-
ILLUSTRATE: explain but more detailed with examples -
Input/Output
-
LOAD: read data from a file STORE: write data to a file (STORE ratings INTO 'outRatings' USING PigStorage(':');)-
DUMP: print result -
User Defined Functions
-
REGISTER: UDF from jar files DEFINE: assign names to UDFs-
IMPORT: import macros from other pig scripts -
Aggregation functions
-
AVG CONCATCOUNTMAXMINSIZE-
SUM -
Pig Storage Classes
-
PigStorage: set the delimiter (; , |) of the input/output data TextLoader: loads one line at timeJsonLoader: loads from jsonAvroStorage: loads from avroParquetLoader: loads from parquet (this is a column-oriented data structure)OrcStorage: loads from orc (compressed format)HBaseStorage: integrate pig with hbase