아주 큰 csv 파일 읽기

티스토리 뷰

개발/파이썬

아주 큰 csv 파일 읽기

맨날치킨 2023. 1. 27. 09:05

Stack Overflow에 자주 검색, 등록되는 문제들과 제가 개발 중 찾아 본 문제들 중에서 나중에도 찾아 볼 것 같은 문제들을 정리하고 있습니다.

Stack Overflow에서 가장 먼저 확인하게 되는 가장 높은 점수를 받은 Solution과 현 시점에 도움이 될 수 있는 가장 최근에 업데이트(최소 점수 확보)된 Solution을 각각 정리하였습니다.

아래 word cloud를 통해 이번 포스팅의 주요 키워드를 미리 확인하세요.

Reading a huge .csv file

거대한 .csv 파일 읽기

문제 내용

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

저는 현재 100MB에서 1.6GB까지의 파일 크기와 1백만 개의 행 및 200개의 열이 있는 .csv 파일에서 데이터를 읽고 있습니다. 300,000개 이하의 행이 있는 파일에서는 이를 (매우 느리게) 수행할 수 있지만, 300,000개 이상의 행을 가진 파일에서는 메모리 오류가 발생합니다. 저의 코드는 다음과 같습니다:

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

getstuff 함수의 else 절이 필요한 이유는 조건에 맞는 모든 요소가 csv 파일에서 함께 나열되기 때문에, 그들을 지나칠 때 루프를 종료하여 시간을 절약하기 위해서입니다.

My questions are:

제 질문은 다음과 같습니다:

How can I manage to get this to work with the bigger files?
Is there any way I can make it faster?

1. 더 큰 파일에서 작동하도록 하려면 어떻게 해야 합니까?
2. 더 빠르게 할 수 있는 방법이 있나요?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

제 컴퓨터는 8GB RAM을 가지고 있으며, 64비트 윈도우 7에서 실행되고 있으며, 프로세서는 3.40 GHz입니다.

높은 점수를 받은 Solution

You are reading all rows into a list, then processing that list. Don't do that.

리스트에 모든 행을 읽고, 그 리스트를 처리하고 있습니다. 이렇게 하지 말아주세요.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

생성된 행을 처리하면서 처리하세요. 데이터를 먼저 필터링해야하는 경우 제너레이터 함수를 사용하세요.

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

필터링 테스트도 더 간결하지만 동일한 논리입니다.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

조건을 충족하는 단일 시퀀스의 행만 일치시키므로 다음을 사용할 수도 있습니다:

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff() directly. Do the same in getdata():

이제 getstuff()을 직접 루프 돌릴 수 있습니다. getdata()에서도 동일하게 수행하세요.

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata() in your code:

이제 코드에서 getdata()를 직접 반복문으로 처리하세요.

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

이제 한 번에 한 행만 메모리에 보유하게 되어, 기존처럼 기준당 수천 줄을 보유하지 않게 되었습니다.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

yield는 함수를 생성자 함수(generator function)로 만드는 역할을 합니다. 즉, 이 함수를 사용하려면 루프(loop)를 실행해야 하며, 루프를 실행하면서 생성자 함수가 값을 생성해냅니다.

가장 최근 달린 Solution

For someone who lands to this question. Using pandas with ‘chunksize’ and ‘usecols’ helped me to read a huge zip file faster than the other proposed options.

이 질문에 답변을 찾는 누군가를 위해 말씀드리면, 'chunksize'와 'usecols'를 이용하여 pandas를 사용하는 것이 다른 제안된 옵션보다 큰 zip 파일을 더 빠르게 읽을 수 있었습니다.

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols’ parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

출처 : https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file

'개발 > 파이썬' 카테고리의 다른 글

딕셔너리에서 최소값에 해당하는 키 가져오기 (0)	2023.01.27
Python에서 pathlib를 사용하여 파일 복사하기 (0)	2023.01.27
Celery에서 'unregistered task of type' 오류 수정하기 (0)	2023.01.26
하나의 데이터 셋에서 테스트와 트레인 샘플 나누어 생성하기 (0)	2023.01.26
내 프로젝트 모듈과 동일 이름의 라이브러리 모듈 사용하기 (0)	2023.01.26

공지사항

최근에 올라온 글

개발자의 일상

티스토리 뷰