여러 CSV 파일을 판다스로 가져와 하나의 데이터프레임으로 만들기

티스토리 뷰

개발/파이썬

여러 CSV 파일을 판다스로 가져와 하나의 데이터프레임으로 만들기

맨날치킨 2022. 12. 23. 17:05

Stack Overflow에 자주 검색, 등록되는 문제들과 제가 개발 중 찾아 본 문제들 중에서 나중에도 찾아 볼 것 같은 문제들을 정리하고 있습니다.

Stack Overflow에서 가장 먼저 확인하게 되는 가장 높은 점수를 받은 Solution과 현 시점에 도움이 될 수 있는 가장 최근에 업데이트(최소 점수 확보)된 Solution을 각각 정리하였습니다.

아래 word cloud를 통해 이번 포스팅의 주요 키워드를 미리 확인하세요.

Import multiple CSV files into pandas and concatenate into one DataFrame

여러 CSV 파일을 판다스로 가져와 하나의 데이터프레임에 연결

문제 내용

I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

디렉토리에서 여러 CSV 파일을 판다스로 읽어 하나의 큰 데이터프레임에 연결하려고 합니다. 하지만 저는 그것을 알아낼 수 없었다. 지금까지 제가 알고 있는 것은 다음과 같습니다.

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop?

for loop 안에서 도움이 필요하다고 생각하나요?

높은 점수를 받은 Solution

See pandas: IO tools for all of the available .read_ methods.

사용 가능한 모든 .read_ 메서드는 pandas: IO tools를 참조하세요.

Try the following code if all of the CSV files have the same columns.

모든 CSV 파일의 열이 동일한 경우 다음 코드를 사용해 보세요.

I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.

CSV 파일의 첫 번째 행을 읽은 후 열 이름으로 지정할 수 있도록 header=0을 추가했습니다.

import pandas as pd
import glob
import os

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Or, with attribution to a comment from Sid.

또는 시드의 의견에 귀를 기울인 것입니다.

all_files = glob.glob(os.path.join(path, "*.csv"))

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.

데이터프레임에 새 열을 추가하여 수행할 수 있는 각 데이터 샘플을 식별해야 하는 경우가 많습니다.
이 예에서는 표준 라이브러리의 pathlib가 사용됩니다. 경로를 잘라낼 문자열 대신 메소드가 있는 개체로 처리합니다.

Imports and Setup

가져오기 및 설정

from pathlib import Path
import pandas as pd
import numpy as np

path = r'C:\DRO\DCL_rawdata_files'  # or unix / linux / mac path

# Get the files from the path provided in the OP
files = Path(path).glob('*.csv')  # .rglob to get subdirectories

Option 1:

옵션 1:

Add a new column with the file name

파일 이름으로 새 열 추가

dfs = list()
for f in files:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['file'] = f.stem
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 2:

옵션 2:

Add a new column with a generic name using enumerate

열거를 사용하여 일반 이름으로 새 열 추가

dfs = list()
for i, f in enumerate(files):
    data = pd.read_csv(f)
    data['file'] = f'File {i}'
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 3:

옵션 3:

Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
- [f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
- [len(df) for df in dfs] creates a list of lengths
Attribution for this option goes to this plotting answer.

리스트 내포를 사용하여 데이터 프레임을 만든 다음 np.repeat를 사용하여 새 열을 추가하십시오.
- [f'S{i}' for i in range(len(dfs))]는 각 데이터 프레임의 이름을 지정할 문자열 목록을 만듭니다.
- [len(df) for df in dfs] 길이 목록을 만듭니다.

이 옵션에 대한 속성은 이 플로팅 응답에 적용됩니다.

# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]

# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)

# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])

Option 4:

옵션 4:

One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2

.assign을 사용하는 한 라이너는 C8H10N4O2의 주석에 따라 새 열을 작성합니다.

df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)

또는

df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)

가장 최근 달린 Solution

Inspired from MrFun's answer:

MrFun의 답변에서 영감을 받았습니다.

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

출처 : https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe

'개발 > 파이썬' 카테고리의 다른 글

딕셔너리 키 이름 변경하기 (0)	2022.12.24
Cython: "fatal error: numpy/arrayobject.h: No such file or directory" 오류 수정하기 (0)	2022.12.23
리스트를 문자열로 변환하기 (0)	2022.12.23
행, 열 값을 사용하여 판다스 데이터프레임의 특정 셀에 값 설정하기 (0)	2022.12.23
init 안에서 await를 이용해 class attribute를 정의하기 (0)	2022.12.22

공지사항

최근에 올라온 글

개발자의 일상

티스토리 뷰