hive partition

最后更新于:2022-04-01 07:28:22

  #hive partition partition是hive提供的一种机制:用户通过指定一个或多个partition key,决定数据存放方式,进而优化数据的查询 一个表可以指定多个partition key,每个partition在hive中以文件夹的形式存在。 **实例(static partition):** ~~~ 编辑文件:/home/work/data/test3.txt; /home/work/data/test4.txt; $ cat /home/work/data/test3.txt 1,zxm 2,ljz 3,cds 4,mac 5,android 6,symbian 7,wp $ cat /home/work/data/test4.txt 8,zxm 9,ljz 10,cds 11,mac 12,android 13,symbian 14,wp ~~~ 建表: ~~~ hive> create table student_tmp(id INT, name STRING)       > partitioned by(academy STRING, class STRING)       > row format delimited fields terminated by ','; OK Time taken: 6.505 seconds id,name是真实列,partition列academy和class是伪列 ~~~ load数据:(此处直接load数据进partition,在hive 0.6之前的版本,必须先创建好partition,数据才能导入) ~~~ hive> load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='computer', class='034'); Copying data from file:/home/work/data/test3.txt Copying file: file:/home/work/data/test3.txt Loading data to table default.student_tmp partition (academy=computer, class=034) OK Time taken: 0.898 seconds hive>  load data local inpath '/home/work/data/test3.txt' into table student_tmp partition(academy='physics', class='034');  Copying data from file:/home/work/data/test3.txt Copying file: file:/home/work/data/test3.txt Loading data to table default.student_tmp partition (academy=physics, class=034) OK Time taken: 0.256 seconds ~~~ 查看hive文件结构: ~~~ $ hadoop fs -ls  /user/hive/warehouse/student_tmp/ Found 2 items drwxr-xr-x   - work supergroup          0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer drwxr-xr-x   - work supergroup          0 2012-07-30 19:00 /user/hive/warehouse/student_tmp/academy=physics $ hadoop fs -ls /user/hive/warehouse/student_tmp/academy=computer Found 1 items drwxr-xr-x   - work supergroup          0 2012-07-30 18:47 /user/hive/warehouse/student_tmp/academy=computer/class=034 ~~~ 查询数据: ~~~ hive> select * from student_tmp where academy='physics'; OK 1       zxm     physics 034 2       ljz     physics 034 3       cds     physics 034 4       mac     physics 034 5       android physics 034 6       symbian physics 034 7       wp      physics 034 Time taken: 0.139 seconds ~~~ 以上是static partition的示例,static partition即由用户指定数据所在的partition,在load数据时,指定partition(academy='computer', class='034'); static partition常适用于使用处理时间作为partition key的例子。 但是,我们也常常会遇到需要向分区表中插入大量数据,并且插入前不清楚数据归宿的partition,此时,我们需要dynamic partition。 使用动态分区需要设置hive.exec.dynamic.partition参数值为true。 可以设置部分列为dynamic partition列,例如:partition(academy='computer', class); 也可以设置所有列为dynamic partition列,例如partition(academy, class); 设置所有列为dynamic partition列时,需要设置hive.exec.dynamic.partition.mode=nonstrict 需要注意的是,主分区为dynamic partition列,而副分区为static partition列是不允许的,例如partition(academy, class=‘034’);是不允许的 **示例(dynamic partition):** ~~~ 建表 hive> create table student(id INT, name STRING)                                                                                 > partitioned by(academy STRING, class STRING)     > row format delimited fields terminated by ','; OK Time taken: 0.393 seconds ~~~ 设置参数 ~~~ hive> set hive.exec.dynamic.partition.mode=nonstrict;  hive> set hive.exec.dynamic.partition=true; ~~~ 导入数据: ~~~ hive> insert overwrite table student partition(academy, class)           > select id,name,academy,class from student_tmp                      > where class='034';                                       Total MapReduce jobs = 2 ......... OK Time taken: 29.616 seconds ~~~ 查询数据: ~~~ hive> select * from student where academy='physics';  OK 1       zxm     physics 034 2       ljz     physics 034 3       cds     physics 034 4       mac     physics 034 5       android physics 034 6       symbian physics 034 7       wp      physics 034 Time taken: 0.165 seconds ~~~ 查看文件: ~~~ $ hadoop fs -ls  /user/hive/warehouse/student/ Found 2 items drwxr-xr-x   - work supergroup          0 2012-07-30 19:22 /user/hive/warehouse/student/academy=computer drwxr-xr-x   - work supergroup          0 2012-07-30 19:22 /user/hive/warehouse/student/academy=physics ~~~ **总结:** hive partition是通过将数据拆分成不同的partition放入不同的文件,从而减少查询操作时数据处理规模的手段。 例如,Hive Select查询中,如果没有建partition,则会扫描整个表内容,这样计算量巨大。如果我们在相应维度做了partition,则处理数据规模可能会大大减少。 附partition相关参数: hive.exec.dynamic.partition(缺省false): 设置为true允许使用dynamic partition hive.exec.dynamic.partition.mode(缺省strick):设置dynamic partition模式(nostrict允许所有partition列都为dynamic partition,strict不允许) hive.exec.max.dynamic.partitions.pernode (缺省100):每一个mapreduce job允许创建的分区的最大数量,如果超过了这个数量就会报错 hive.exec.max.dynamic.partitions (缺省1000):一个dml语句允许创建的所有分区的最大数量 hive.exec.max.created.files (缺省100000):所有的mapreduce job允许创建的文件的最大数量 reference: [Dynamic Partitions](https://cwiki.apache.org/Hive/dynamicpartitions.html) [hive中简单介绍分区表(partition table),含动态分区(dynamic partition)与静态分区(static partition)](http://blog.sina.com.cn/s/blog_6ff05a2c0100tah0.html)
';