Hive(Hadoop)中的COLLECT_SET()

| 我刚刚了解了Hive中的collect_set()函数,并在开发3节点集群上开始了工作。 我只有大约10 GB的空间要处理。然而,这项工作实际上是永远的。我认为可能是collect_set()实现中的错误,代码中的错误或collect_set()函数的确占用大量资源。 这是Hive的My SQL(无双关语):
INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
       sess.remote_address as remote_address,
       sess.hit_count as hit_count,
       COLLECT_SET(evt.event_id) as event_set,
       hit.rsp_timestamp as hit_timestamp,
       sess.site_link as site_link
    FROM site_session sess 
        JOIN (SELECT * FROM site_event 
                WHERE event_id = 274 OR event_id = 284 OR event_id = 55 OR event_id = 151) evt 
            ON (sess.session_key = evt.session_key)
        JOIN site_hit hit ON (sess.session_key = evt.session_key)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
有4个MR通行证。第一次大约花了30秒。第二张地图大约花了1分钟。而第二次还原的大部分时间约为2分钟。在过去的两个小时中,它从97.71%增长到97.73%。这是正确的吗?我认为一定有问题。我看了一下日志,看不出它是否正常。 [日志样本]
2011-06-21 16:32:22,715 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:22,758 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5142000000 rows
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5142000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5143000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5143000000 rows
2011-06-21 16:32:24,725 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:24,768 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 6 forwarding 42000000 rows
2011-06-21 16:32:24,771 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5144000000 rows
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5144000000 rows
2011-06-21 16:32:26,467 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5145000000 rows
2011-06-21 16:32:26,468 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5145000000 rows
我在这方面还很陌生,尝试与collect_set()和Hive Array一起使用将我带入了深渊。 提前致谢 :)     
已邀请:
        重大失败。我的解决方案如下。毕竟COLLECT_SET没问题,它只是试图收集所有项目,而这些项目是无限的。 为什么?因为我加入了一些甚至都不属于集合的东西。第二次连接曾经是相同的“打开”条件,现在正确地表示为“ 2”
INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
       sess.remote_address as remote_address,
       sess.hit_count as hit_count,
       COLLECT_SET(evt.event_id) as event_set,
       hit.rsp_timestamp as hit_timestamp,
       sess.site_link as site_link
    FROM tealeaf_session sess 
        JOIN site_event evt ON (sess.session_key = evt.session_key)
        JOIN site_hit hit   ON (sess.session_key = hit.session_key)
    WHERE evt.event_id IN(274,284,55,151)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
    
        我要尝试的第一件事是摆脱子选择,仅加入site_event,然后将event_id过滤器移至外部where子句并将其更改为in()。所以像这样:
SELECT sess.session_key as session_key,
   sess.remote_address as remote_address,
   sess.hit_count as hit_count,
   COLLECT_SET(evt.event_id) as event_set,
   hit.rsp_timestamp as hit_timestamp,
   sess.site_link as site_link
FROM site_session sess 
    JOIN site_event evt ON (sess.session_key = evt.session_key)
    JOIN site_hit hit ON (sess.session_key = evt.session_key)
WHERE evt.event_id in(274,284,55151)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
此外,我不知道每个表的大小,但是通常在Hive中,您希望将最大的表(通常是事实表)保留在联接的右侧,以减少内存使用。原因是Hive试图将联接的左手侧保留在内存中,并通过流右手来完成联接。     
        我猜这是怎么回事,它为每个返回的行生成了
COLLECT_SET()
。 因此,对于您要返回的每一行,它都将返回由
COLLECT_SET
生成的整个数组。这可能会很累并且需要很长时间。 在查询中以
COLLECT_SET
检查性能。如果足够快,则将
COLLECT_SET
的计算推入子查询,然后使用该列而不是在您所在的位置进行计算。 根据您的帖子,我没有使用COLLECT_SET或进行任何测试,而这正是我首先要怀疑的。     

要回复问题请先登录注册