阅读背景:

如何在谷歌云数据流中访问CompressedSource的每个条目?并获取每个子文件的Byte []

来源:互联网 

I have a compressed file which is a gzip file composed of multiple text file on google storage. I need to access each subfile and do some operation like regular expression. I can do the same thing on my local computer like this.

我有一个压缩文件,这是一个由谷歌存储上的多个文本文件组成的gzip文件。我需要访问每个子文件并执行一些操作,如正则表达式。我可以在我的本地计算机上做同样的事情。

pubic static void untarFile( String filepath ) throw IOException {
  try {
    FileInputStream fin = new FileInputStream(filepath);
    BufferedInputStream in = new BufferedInputStream(fin);
    GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
    TarArchiveInputStream tarInput = new TarArchiveInputStream(gzIn);
    TarArchiveEntry entry = null;
    while ((entry = (TarArchiveEntry) tarInput.getNextTarEntry() ) != null) {
    byte[] fileContent = new byte (int)entry.getSize() ];
    tarInput.read(fileContent, 0, fileContent.length);
    }
  }
}

Therefore, I can do some other operation on fileContent which is a byte[ ]. So I used CompressedSource on google cloud dataflow and refer to its test code.It seems that I can only get every byte from file instead of whole byet[] of subfile, so I am wondering if there is any solution for me to do this on google cloud dataflow.

因此,我可以对fileContent执行一些其他操作,这是一个byte []。所以我在谷歌云数据流上使用了CompressedSource并参考了它的测试代码。似乎我只能从文件中获取每个字节而不是整个子文件的byet [],所以我想知道是否有任何解决方案让我这样做谷歌云数据流。

1 个解决方案

#1


1  

TextIO does not support this directly, but you can create a new subclass of FileBasedSource to do this. You'll want to override isSplittable() to always return false, and then have readNextRecord() just read the entire file.

TextIO不直接支持此功能,但您可以创建FileBasedSource的新子类来执行此操作。你想要覆盖isSplittable()总是返回false,然后让readNextRecord()只读取整个文件。


分享到: