阅读背景:

Cloudera Impala:它如何从HDFS块中读取数据?

来源:互联网 

I had a basic question in Impala. We know that Impala allows you to query data that is stored in HDFS. Now, if a file is split into multiple blocks, and let us say a line of text is spread across two blocks. In Hive/MapReduce, the RecordReader takes care of this.

我在Impala有一个基本问题。我们知道Impala允许您查询存储在HDFS中的数据。现在,如果一个文件被分成多个块,让我们说一行文本分布在两个块中。在Hive / MapReduce中,RecordReader会处理这个问题。

How does Impala read the record in such a scenario?

Impala如何在这种情况下读取记录?

1 个解决方案

#1


2  

Referencing my answer on the Impala user list:

在Impala用户列表中引用我的答案:

When Impala finds an incomplete record (e.g. which can happen scanning certain file formats such as text or rc files), it will continue to read incrementally from the next block(s) until it has read the entire record. Note that this may require small amounts of 'remote reads' (reading from a remote datanode), but usually this is a very small amount compared to the entire block which should have been read locally (and ideally via a short circuit read).

当Impala找到不完整的记录(例如,可能会扫描某些文件格式,如文本或rc文件)时,它将继续从下一个块逐步读取,直到它读取整个记录。请注意,这可能需要少量的“远程读取”(从远程数据节点读取),但与本应读取的整个块(理想情况下通过短路读取)相比,这通常是非常小的量。


分享到: