Python如何一次读取N行数

| 我正在编写代码，一次获取一个巨大的文本文件（几个GB）N行，对该批处理，然后移至下N行，直到完成整个文件。（我不在乎最后一批不是最佳尺寸）。我一直在阅读有关使用itertools islice进行此操作的信息。我想我已经中途了：

from itertools import islice
N = 16
infile = open(\"my_very_large_text_file\", \"r\")
lines_gen = islice(infile, N)

for lines in lines_gen:
     ...process my lines...

麻烦的是我想处理下一批16行，但是我缺少一些东西

已邀请:

6 个回复

翰冒绢县

islice()可用于获取迭代器的下一个n项。因此，list(islice(f, n))将返回文件f的后ѭ2the行的列表。在循环中使用它会以n行的块的形式为您提供文件。在文件末尾，列表可能会更短，最后调用将返回一个空列表。

from itertools import islice
with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

一种替代方法是使用石斑鱼模式：

with open(...) as f:
    for next_n_lines in izip_longest(*[f] * n):
        # process next_n_lines

癸痊醒

该问题似乎假设通过一次读取N行块中的“巨大文本文件”可以获得效率。这在已经高度优化的stdio库上增加了缓冲的应用程序层，增加了复杂性，并且可能根本没有给您带来任何好处。从而：

with open(\'my_very_large_text_file\') as f:
    for line in f:
        process(line)

在时间，空间，复杂性和可读性方面可能优于任何替代方法。另请参阅Rob Pike的前两个规则，Jackson的两个规则和PEP-20 The Zen of Python。如果您真的只想和islice玩，那您应该省略大文件了。

凄挡

这是使用groupby的另一种方法：

from itertools import count, groupby

N = 16
with open(\'test\') as f:
    for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
        print list(group)

怎么运行的：基本上，groupby（）将根据key参数的返回值对行进行分组，并且key参数是lambda函数lambda _, c=count(): c.next()/N，并且使用以下事实：当函数定义时，c参数将绑定到count（），因此每次groupby()将调用lambda函数并评估返回值以确定将行分组的分组器，因此：

# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1   
...

徘廷

由于增加了从文件中选择的行在统计上必须均匀分布的要求，因此我提供了这种简单的方法。

\"\"\"randsamp - extract a random subset of n lines from a large file\"\"\"

import random

def scan_linepos(path):
    \"\"\"return a list of seek offsets of the beginning of each line\"\"\"
    linepos = []
    offset = 0
    with open(path) as inf:     
        # WARNING: CPython 2.7 file.tell() is not accurate on file.next()
        for line in inf:
            linepos.append(offset)
            offset += len(line)
    return linepos

def sample_lines(path, linepos, nsamp):
    \"\"\"return nsamp lines from path where line offsets are in linepos\"\"\"
    offsets = random.sample(linepos, nsamp)
    offsets.sort()  # this may make file reads more efficient

    lines = []
    with open(path) as inf:
        for offset in offsets:
            inf.seek(offset)
            lines.append(inf.readline())
    return lines

dataset = \'big_data.txt\'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once

lines = sample_lines(dataset, linepos, nsamp)
print \'selecting %d lines from a file of %d\' % (nsamp, len(linepos))
print \'\'.join(lines)

我在包含1.7GB磁盘的300万行的模拟数据文件上进行了测试。在我不太热的桌面上，.17占主导地位的运行时大约需要20秒。为了检查sample_lines的性能，我使用了used19ѭ模块

import timeit
t = timeit.Timer(\'sample_lines(dataset, linepos, nsamp)\', 
        \'from __main__ import sample_lines, dataset, linepos, nsamp\')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u\'%dk trials in %.2f seconds, %.2fµs per trial\' % (trials/1000,
        elapsed, (elapsed/trials) * (10 ** 6))

对于nsamp的各种值；当nsamp为100时，一个sample_lines在460µs内完成，并线性扩展至10k样本，每次调用时间为47ms。自然而然的下一个问题是“随机”根本不是随机的吗？答案是“亚密码学，但对于生物信息学当然很好”。

黎喊病

使用的分块函数来自什么：迭代分块列表的最“ pythonic”方式是什么？：

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    \"grouper(3, \'ABCDEFG\', \'x\') --> ABC DEF Gxx\"
    args = [iter(iterable)] * n
    return izip_longest(*args, fillvalue=fillvalue)


with open(filename) as f:
    for lines in grouper(f, chunk_size, \"\"): #for every chunk_sized chunk
        \"\"\"process lines like 
        lines[0], lines[1] , ... , lines[chunk_size-1]\"\"\"

篮肥炼皖

假设“批处理”意味着要一次而不是单独处理所有16个记录，一次读取一个记录并更新一个计数器。当计数器达到16时，处理该组。

interim_list = []
infile = open(\"my_very_large_text_file\", \"r\")
ctr = 0
for rec in infile:
    interim_list.append(rec)
    ctr += 1
    if ctr > 15:
        process_list(interim_list)
        interim_list = []
        ctr = 0

the final group

process_list(interim_list)

要回复问题请先登录或注册

Python如何一次读取N行数

6 个回复

the final group

发起人

python

lines

itertools

问题状态

Python如何一次读取N行数

与内容相关的链接

6 个回复

the final group

发起人

python

lines

itertools

问题状态