现有一份以下格局的数据:
表现有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:
id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e
编写Hive的HQL语句来实现以下成果:
表中的1表现选修,表中的0表现未选修
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
解题思路:
1、创立表 导入数据
create database if not exists exercise;
use exercise;
drop table if exists id_course;
create table if not exists id_course(id int, course string) row format delimited fields terminated by ",";
truncate table id_course;
load data local inpath "/home/hadoop/id_course.txt" into table id_course;
select * from id_course;
成果:
1 a
1 b
1 c
1 e
2 a
2 c
2 d
2 f
3 a
3 b
3 c
3 e
2、先求出所有课程的聚集
第一步:
select collect_set(course) as courses from id_course;
成果:
["a","b","c","e","d","f"]
第二步:
set hive.strict.checks.cartesian.product=false;
create table id_courses as select t1.id as id,t1.course as id_courses,t2.course courses
from
( select id as id,collect_set(course) as course from id_course group by id ) t1
join
(select collect_set(course) as course from id_course) t2;
成果:id_courses表
id id_courses courses
1 ["a","b","c","e"] ["a","b","c","e","d","f"]
2 ["a","c","d","f"] ["a","b","c","e","d","f"]
3 ["a","b","c","e"] ["a","b","c","e","d","f"]
3、得出终究成果:
思路:
拿出course字段中的每一个元素在id_courses中进行断定,看是不是存在。
select id,
case when array_contains(id_courses, courses[0]) then 1 else 0 end as a,
case when array_contains(id_courses, courses[1]) then 1 else 0 end as b,
case when array_contains(id_courses, courses[2]) then 1 else 0 end as c,
case when array_contains(id_courses, courses[3]) then 1 else 0 end as d,
case when array_contains(id_courses, courses[4]) then 1 else 0 end as e,
case when array_contains(id_courses, courses[5]) then 1 else 0 end as f
from id_courses;
成果:
1 1 1 1 1 0 0
2 1 0 1 0 1 1
3 1 1 1 1 0 0现有一份以下格局的数据:
表现有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门: