데이터프레임 행 순서 무작위로 섞기

티스토리 뷰

개발/파이썬

데이터프레임 행 순서 무작위로 섞기

맨날치킨 2022. 12. 17. 21:05

Stack Overflow에 자주 검색, 등록되는 문제들과 제가 개발 중 찾아 본 문제들 중에서 나중에도 찾아 볼 것 같은 문제들을 정리하고 있습니다.

Stack Overflow에서 가장 먼저 확인하게 되는 가장 높은 점수를 받은 Solution과 현 시점에 도움이 될 수 있는 가장 최근에 업데이트(최소 점수 확보)된 Solution을 각각 정리하였습니다.

아래 word cloud를 통해 이번 포스팅의 주요 키워드를 미리 확인하세요.

Shuffle DataFrame rows

데이터프레임 행 순서 섞기

문제 내용

I have the following DataFrame:

다음과 같은 데이터 프레임이 있습니다.

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a CSV file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

데이터 프레임은 CSV 파일에서 읽습니다. 유형 1이 있는 모든 행이 맨 위에 있고, 유형 2가 있는 행, 유형 3이 있는 행이 뒤에 있습니다.

I would like to shuffle the order of the DataFrame's rows so that all Type's are mixed. A possible result could be:

모든 유형이 혼합되도록 데이터 프레임의 행 순서를 섞고 싶습니다. 가능한 결과는 다음과 같습니다.

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?

어떻게 하면 달성할 수 있을까요?

높은 점수를 받은 Solution

The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

Panda를 사용하는 일반적인 방법은 데이터프레임의 .sample 메서드를 사용하여 모든 행을 교체하지 않고 샘플링하는 것입니다.

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).

frac 키워드 인수는 랜덤 표본에서 반환할 행의 비율을 지정하므로 frac=1은 모든 행을 (랜덤 순서로) 반환함을 의미합니다.

Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

참고: 데이터프레임을 제자리에서 셔플하고 인덱스를 재설정하려면 다음과 같이 하십시오.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

여기에서 drop=True를 지정하면 .reset_index가 이전 인덱스 항목을 포함하는 열을 생성하지 못합니다.(이전 인덱스를 유지하지 않는다)

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

후속 조치 참고: 위의 작업이 제자리에 있는 것처럼 보이지 않을 수도 있지만, 파이썬/판다는 셔플된 개체에 대해 다른 malloc를 수행하지 않을 만큼 충분히 똑똑합니다. 즉, 참조 객체가 변경되더라도(즉, id(df_old)는 id(df_new)와 동일하지 않다) 기본 C 객체는 여전히 동일합니다. 이러한 경우를 실제로 보여주기 위해 간단한 메모리 프로파일러를 실행할 수 있습니다.

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

가장 최근 달린 Solution

Following could be one of ways:

다음은 한 가지 방법일 수 있습니다.

dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)

frac=1 means all rows of a data frame

frac=1은 데이터프레임의 모든 행을 의미합니다.

random_state=42 means keeping the same order in each execution

random_state=42는 각 실행에서 동일한 순서를 유지함을 의미합니다.

reset_index(drop=True) means reinitialize index for randomized dataframe

reset_index(drop=True)는 랜덤화된 데이터 프레임에 대한 인덱스를 다시 초기화함을 의미합니다.

출처 : https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows

'개발 > 파이썬' 카테고리의 다른 글

데이터프레임에서 SQL의 'in', 'not in'처럼 필터링하기 (0)	2022.12.18
딕셔너리에 키가 없을 때도 에러 없이 기본 값 반환 받기 (0)	2022.12.18
특정 키만 포함하도록 딕셔너리를 필터링하기 (0)	2022.12.17
효율적인 딕셔너리 검색 방법 (0)	2022.12.17
파이썬에서 파일에 대한 flush 빈도 (0)	2022.12.17

공지사항

최근에 올라온 글

개발자의 일상

티스토리 뷰