파이썬 판다스에서 데이터프레임을 두 개 이상의 열(column)로 정렬하기

티스토리 뷰

개발/파이썬

파이썬 판다스에서 데이터프레임을 두 개 이상의 열(column)로 정렬하기

맨날치킨 2023. 2. 20. 10:05

Stack Overflow에 자주 검색, 등록되는 문제들과 제가 개발 중 찾아 본 문제들 중에서 나중에도 찾아 볼 것 같은 문제들을 정리하고 있습니다.

Stack Overflow에서 가장 먼저 확인하게 되는 가장 높은 점수를 받은 Solution과 현 시점에 도움이 될 수 있는 가장 최근에 업데이트(최소 점수 확보)된 Solution을 각각 정리하였습니다.

아래 word cloud를 통해 이번 포스팅의 주요 키워드를 미리 확인하세요.

How to sort a dataFrame in python pandas by two or more columns?

파이썬 판다스에서 데이터프레임을 두 개 이상의 열(column)로 정렬하는 방법은?

문제 내용

Suppose I have a dataframe with columns a, b and c, I want to sort the dataframe by column b in ascending order, and by column c in descending order, how do I do this?

a, b, c 열을 가진 데이터프레임이 있다고 가정해봐요. 데이터프레임을 b열 기준으로 오름차순으로 정렬하고, 그 다음 c열 기준으로 내림차순으로 정렬하려면 어떻게 해야 할까요?

높은 점수를 받은 Solution

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same:

버전 0.17.0부터 sort 메서드가 sort_values로 대체되었습니다. 0.20.0부터는 sort 메서드가 완전히 제거되었습니다. 인수 (및 결과)는 이전과 동일합니다:

df.sort_values(['a', 'b'], ascending=[True, False])

You can use the ascending argument of sort:

sort 메서드의 ascending 인수를 사용할 수 있습니다:

df.sort(['a', 'b'], ascending=[True, False])

For example:

예:

In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])

In [12]: df1.sort(['a', 'b'], ascending=[True, False])
Out[12]:
   a  b
2  1  4
7  1  3
1  1  2
3  1  2
4  3  2
6  4  4
0  4  3
9  4  3
5  4  1
8  4  1

As commented by @renadeen

@renadeen의 코멘트와 같이,

Sort isn't in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.

sort는 기본적으로 inplace가 아닙니다! 따라서 sort 메서드의 결과를 변수에 할당하거나 method 호출에 inplace=True를 추가해야합니다.

that is, if you want to reuse df1 as a sorted DataFrame:

즉, df1을 정렬된 데이터프레임으로 재사용하려면:

df1 = df1.sort(['a', 'b'], ascending=[True, False])

또는

df1.sort(['a', 'b'], ascending=[True, False], inplace=True)

가장 최근 달린 Solution

For large dataframes of numeric data, you may see a significant performance improvement via numpy.lexsort, which performs an indirect sort using a sequence of keys:

숫자 데이터의 대형 데이터프레임의 경우, numpy.lexsort를 사용하면 상당한 성능 향상을 볼 수 있습니다. numpy.lexsort는 키(sequence of keys)를 사용하여 정렬을 수행하는 간접 정렬을 수행합니다.

import pandas as pd
import numpy as np

np.random.seed(0)

df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
df1 = pd.concat([df1]*100000)

def pdsort(df1):
    return df1.sort_values(['a', 'b'], ascending=[True, False])

def lex(df1):
    arr = df1.values
    return pd.DataFrame(arr[np.lexsort((-arr[:, 1], arr[:, 0]))])

assert (pdsort(df1).values == lex(df1).values).all()

%timeit pdsort(df1)  # 193 ms per loop
%timeit lex(df1)     # 143 ms per loop

One peculiarity is that the defined sorting order with numpy.lexsort is reversed: (-'b', 'a') sorts by series a first. We negate series b to reflect we want this series in descending order.

numpy.lexsort로 정의된 정렬 순서는 반대로 뒤집힙니다: (-'b', 'a')는 우선순위가 높은 a 열을 기준으로 정렬합니다. b 열을 부정하여 이 열이 내림차순으로 정렬되도록합니다.

Be aware that np.lexsort only sorts with numeric values, while pd.DataFrame.sort_values works with either string or numeric values. Using np.lexsort with strings will give: TypeError: bad operand type for unary -: 'str'.

np.lexsort는 숫자 값으로만 정렬하며, pd.DataFrame.sort_values는 문자열 또는 숫자 값으로 작동합니다. np.lexsort를 문자열과 함께 사용하면 TypeError: bad operand type for unary -: 'str'와 같은 오류가 발생합니다.

출처 : https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns

'개발 > 파이썬' 카테고리의 다른 글

두 개의 딕셔너리를 비교하고 (key, value) 쌍이 얼마나 일치하는지 확인하기 (0)	2023.02.20
SQLAlchemy를 사용하여 db에 전송된 SQL 명령을 디버깅(출력)하기 (0)	2023.02.20
groupby를 사용하여 그룹 내에서 최대 값을 가진 행을 가져오는 방법 (0)	2023.02.20
Windows에서 tkinter를 pip 또는 easy_install로 설치하기 (0)	2023.02.19
pip를 사용하여 Scipy 설치할 때 오류 수정하기 (0)	2023.02.19

공지사항

최근에 올라온 글

개발자의 일상

티스토리 뷰