hive之student – 源码集中营

现有一份以下格局的数据：
表现有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门：

id course 
1,a 
1,b 
1,c 
1,e 
2,a 
2,c 
2,d 
2,f 
3,a 
3,b 
3,c 
3,e


编写Hive的HQL语句来实现以下成果：
表中的1表现选修，表中的0表现未选修

id    a    b    c    d    e    f
1    1    1    1    0    1    0
2    1    0    1    1    0    1
3    1    1    1    0    1    0


解题思路：

1、创立表  导入数据

create database if not exists exercise;
use exercise;
drop table if exists id_course;
create table if not exists id_course(id int, course string) row format delimited fields terminated by ",";
truncate table id_course;
load data local inpath "/home/hadoop/id_course.txt" into table id_course;
select * from id_course;


成果：
1       a
1       b
1       c
1       e
2       a
2       c
2       d
2       f
3       a
3       b
3       c
3       e


2、先求出所有课程的聚集

第一步：
select collect_set(course) as courses from id_course;

成果：

["a","b","c","e","d","f"]

第二步：
set hive.strict.checks.cartesian.product=false;

create table id_courses as select t1.id as id,t1.course as id_courses,t2.course courses 
from 
( select id as id,collect_set(course) as course from id_course group by id ) t1 
join 
(select collect_set(course) as course from id_course) t2;


成果：id_courses表
id    id_courses            courses
1       ["a","b","c","e"]       ["a","b","c","e","d","f"]
2       ["a","c","d","f"]       ["a","b","c","e","d","f"]
3       ["a","b","c","e"]       ["a","b","c","e","d","f"]


3、得出终究成果：
思路：
拿出course字段中的每一个元素在id_courses中进行断定，看是不是存在。

select id,
case when array_contains(id_courses, courses[0]) then 1 else 0 end as a,
case when array_contains(id_courses, courses[1]) then 1 else 0 end as b,
case when array_contains(id_courses, courses[2]) then 1 else 0 end as c,
case when array_contains(id_courses, courses[3]) then 1 else 0 end as d,
case when array_contains(id_courses, courses[4]) then 1 else 0 end as e,
case when array_contains(id_courses, courses[5]) then 1 else 0 end as f 
from id_courses;

成果：
1       1       1       1       1       0       0
2       1       0       1       0       1       1
3       1       1       1       1       0       0现有一份以下格局的数据：
表现有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门：