00. 판다스(Pandas) 기본 자료구조

https://pandas.pydata.org/docs/reference/index.html

API reference — pandas 2.2.3 documentation

This page gives an overview of all public pandas objects, functions and methods. All classes and functions exposed in pandas.* namespace are public. The following subpackages are public. In addition, public functions in pandas.io and pandas.tseries submodu

pandas.pydata.org

1) 시리즈(Series)

판다스의 두 가지 대표적인 자료구조가 바로 시리즈(Series)와 데이터프레임(Dataframe)입니다.

https://pandas.pydata.org/docs/reference/api/pandas.Series.html

pandas.Series — pandas 2.2.3 documentation

Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the ind

pandas.pydata.org

Pandas의 Series는 1차원 배열로서 다음의 특징을 가집니다.

데이터를 담는 차원 배열 구조를 가집니다.
인덱스(index)를 사용 가능합니다.
데이터 타입을 가집니다. (dtype)

Series의 생성

numpy array로 생성한 경우

# numpy 생성
arr = np.arange(100, 105)
arr

[출력]

array([100, 101, 102, 103, 104])

s = pd.Series(arr)
s

[출력]

0    100
1    101
2    102
3    103
4    104
dtype: int64

dtype을 지정하여 생성한 경우

s = pd.Series(arr, dtype='int32')
s

[출력]

0    100
1    101
2    102
3    103
4    104
dtype: int32

list로 생성한 경우

s = pd.Series(['부장', '차장', '대리', '사원', '인턴'])
s

[출력]

0    부장
1    차장
2    대리
3    사원
4    인턴
dtype: object

다양한 타입(type)의 데이터를 섞은 경우

Series에 다양한 데이터 타입의 데이터로 생성시, object 타입으로 생성됩니다.

s = pd.Series([91, 2.5, '스포츠', 4, 5.16])
s

[출력]

0      91
1     2.5
2     스포츠
3       4
4    5.16
dtype: object

인덱싱 (indexing)

s = pd.Series(['부장', '차장', '대리', '사원', '인턴'])
s

[출력]

0    부장
1    차장
2    대리
3    사원
4    인턴
dtype: object

Series 생성시 0부터 순차적으로 부여되는 index를 확인할 수 있습니다.

이를 RangeIndex라 부릅니다.

s.index

[출력]

RangeIndex(start=0, stop=5, step=1)

인덱싱 사례

s[0]

[출력]

'부장'

fancy indexing

s[[1, 3]]

[출력]

1    차장
3    사원
dtype: object

s[np.arange(1, 4, 2)]

[출력]

1    차장
3    사원
dtype: object

boolean indexing

조건식을 만들어서 특정 조건에 대하여 True에 해당하는 값만 필터링 할 수 있습니다.

np.random.seed(0)
s = pd.Series(np.random.randint(10000, 20000, size=(10,)))
s

[출력]

0    12732
1    19845
2    13264
3    14859
4    19225
5    17891
6    14373
7    15874
8    16744
9    13468
dtype: int64

boolean series를 생성 후 index로 활용하여 필터합니다.

s > 15000

[출력]

0    False
1     True
2    False
3    False
4     True
5     True
6    False
7     True
8     True
9    False
dtype: bool

# 15000 이상인 데이터 필터
s[s > 15000]

[출력]

1    19845
4    19225
5    17891
7    15874
8    16744
dtype: int64

기본 값으로 부여되는 RangeIndex에 사용자 정의의 index를 지정할 수 있습니다.

s = pd.Series(['마케팅', '경영', '개발', '기획', '인사'], index=['a', 'b', 'c', 'd', 'e'])
s

[출력]

a    마케팅
b     경영
c     개발
d     기획
e     인사
dtype: object

s.index

[출력]

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

사용자 정의의 index 부여시 변경된 index로 조회 가능합니다.

s['c']

[출력]

'개발'

s[['a', 'd']]

[출력]

a    마케팅
d     기획
dtype: object

먼저, Series를 생성 후 index 속성 값에 새로운 index를 할당하여 인덱스를 지정할 수 있습니다.

s = pd.Series(['마케팅', '경영', '개발', '기획', '인사'])
s.index

[출력]

RangeIndex(start=0, stop=5, step=1)

s.index = list('abcde')

[출력]

a    마케팅
b     경영
c     개발
d     기획
e     인사
dtype: object

속성 (attribute)

values

values는 Series 데이터 값(value)만 numpy array 형식으로 가져 옵니다.

s.values

[출력]

array(['마케팅', '경영', '개발', '기획', '인사'], dtype=object)

ndim - 차원

Series는 1차원 자료구조이기 때문에 ndim 출력시 1이 출력됩니다.

s.ndim

[출력]

shape

shape은 데이터의 모양(shape)을 알아보기 위하여 사용하는데, Series의 shape은 데이터의 갯수를 나타냅니다.

튜플(tuple) 형식으로 출력됩니다.

s.shape

[출력]

(5,)

NaN (Not a Number)

Pandas에서 NaN 값은 비어있는 결측치 데이터를 의미합니다.

임의로 비어있는 값을 대입하고자 할 때는 numpy의 nan (np.nan)을 입력합니다.

s = pd.Series(['선화', '강호', np.nan, '소정', '우영'])
s

[출력]

0     선화
1     강호
2    NaN
3     소정
4     우영
dtype: object

결측치 (NaN) 값 처리

isnull()과 isna()은 NaN 값을 찾는 함수 입니다.

isnull()과 isna()는 결과가 동일합니다.

s.isnull()

[출력]

0    False
1    False
2     True
3    False
4    False
dtype: bool

s.isna()

[출력]

0    False
1    False
2     True
3    False
4    False
dtype: bool

이를 boolean indexing에 적용해볼 수 있습니다.

s[s.isnull()]

[출력]

2    NaN
dtype: object

s[s.isna()]

[출력]

2    NaN
dtype: object

notnull()은 NaN값이 아닌, 즉 비어있지 않은 데이터를 찾는 함수 입니다.

s.notnull()

[출력]

0     True
1     True
2    False
3     True
4     True
dtype: bool

s.notna()

[출력]

0     True
1     True
2    False
3     True
4     True
dtype: bool

s[s.notnull()]

[출력]

0    선화
1    강호
3    소정
4    우영
dtype: object

슬라이싱

(주의) 숫자형 index로 접근할 때는 뒷 index가 포함되지 않습니다.

s = pd.Series(np.arange(100, 150, 10))
s

[출력]

0    100
1    110
2    120
3    130
4    140
dtype: int64

s[1:3]

[출력]

1    110
2    120
dtype: int64

새롭게 지정한 인덱스(문자열)는 시작 index와 끝 index 모두 포함합니다.

s.index = list('가나다라마')
s

[출력]

가    100
나    110
다    120
라    130
마    140
dtype: int64

s['나':'라']

[출력]

나    110
다    120
라    130
dtype: int64

* 문제 *

다음의 Series를 생성하세요.

dtype은 float32가 되도록 생성하세요

# 코드를 입력해 주세요
pd.Series(np.arange(3, 12, 2), dtype='float32')

[출력]

0     3.0
1     5.0
2     7.0
3     9.0
4    11.0
dtype: float32

# 코드를 입력해 주세요
pd.Series(list('가나다라마'))

[출력]

0    가
1    나
2    다
3    라
4    마
dtype: object

다음의 Series를 생성하고 sample 변수에 대입하고 출력하세요

# 코드를 입력해 주세요
sample = pd.Series(np.arange(10, 60, 10), index=list('가나다라마'))
sample

[출력]

가    10
나    20
다    30
라    40
마    50
dtype: int64

sample중 '나'와 '라' 데이터를 조회하세요

# 코드를 입력해 주세요
sample[['나', '라']]

[출력]

나    20
라    40
dtype: int64

np.random.seed(20)
sample2 = pd.Series(np.random.randint(100, 200, size=(15,)))
sample2

[출력]

0     199
1     190
2     115
3     195
4     128
5     190
6     109
7     120
8     175
9     122
10    171
11    134
12    196
13    140
14    185
dtype: int64

sample2 중 160 이하인 데이터만 필터하세요

# 코드를 입력해 주세요
sample2[sample2 <= 160]

[출력]

2     115
4     128
6     109
7     120
9     122
11    134
13    140
dtype: int64

sample2 중 130 이상 170 이하인 데이터만 필터하세요

# 코드를 입력해 주세요
sample2[(sample2 >= 130) & (sample2 <= 170)]

[출력]

11    134
13    140
dtype: int64

다음과 같은 Series를 생성해 주세요

# 코드를 입력해 주세요
pd.Series(['apple', np.nan, 'banana', 'kiwi', 'gubong'], index=list('가나다라마'))

[출력]

가     apple
나       NaN
다    banana
라      kiwi
마    gubong
dtype: object

sample = pd.Series(['IT서비스', np.nan, '반도체', np.nan, '바이오', '자율주행'])
sample

[출력]

0    IT서비스
1      NaN
2      반도체
3      NaN
4      바이오
5     자율주행
dtype: object

sample 중 결측치 데이터만 필터하세요

# 코드를 입력해 주세요
sample[sample.isnull()]

[출력]

1    NaN
3    NaN
dtype: object

sample중 결측치가 아닌 데이터만 필터하세요

# 코드를 입력해 주세요
sample[sample.notnull()]

[출력]

0    IT서비스
2      반도체
4      바이오
5     자율주행
dtype: object

np.random.seed(0)
sample = pd.Series(np.random.randint(100, 200, size=(10,)))
sample

[출력]

0    144
1    147
2    164
3    167
4    167
5    109
6    183
7    121
8    136
9    187
dtype: int64

sample에서 다음과 같은 결과를 가지도록 슬라이싱 하세요

# 코드를 입력해 주세요
sample[2:7]

[출력]

2    164
3    167
4    167
5    109
6    183
dtype: int64

np.random.seed(0)
sample2 = pd.Series(np.random.randint(100, 200, size=(10,)), index=list('가나다라마바사아자차'))
sample2

[출력]

가    144
나    147
다    164
라    167
마    167
바    109
사    183
아    121
자    136
차    187
dtype: int64

sample2에서 다음과 같은 결과를 가지도록 슬라이싱 하세요

# 코드를 입력해 주세요
sample2['바':]

[출력]

바    109
사    183
아    121
자    136
차    187
dtype: int64

# 코드를 입력해 주세요
sample2[:'다']

[출력]

가    144
나    147
다    164
dtype: int64

# 코드를 입력해 주세요
sample2['나':'바']

[출력]

나    147
다    164
라    167
마    167
바    109
dtype: int64

https://pandas.pydata.org/docs/reference/frame.html

DataFrame — pandas 2.2.3 documentation

Warning DataFrame.attrs is considered experimental and may change without warning.

pandas.pydata.org

2) 데이터프레임(DataFrame)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

pandas.DataFrame — pandas 2.2.3 documentation

Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a

pandas.pydata.org

앞서, 판다스의 대표적인 두 가지 자료구조인 Series와 DataFrame 가운데 Series에 대해 배웠으므로, 이번에는 DataFrame에 대해 알아보도록 합니다.

DataFrame 도큐먼트

pd.DataFrame

2차원 데이터 구조 (Excel 데이터 시트를 생각하시면 됩니다)
행(row), 열(column)으로 구성되어 있습니다.
각 열(column)은 각각의 데이터 타입 (dtype)을 가집니다.

생성

list 를 통한 생성할 수 있습니다. DataFrame을 만들 때는 2차원 list를 대입합니다.

pd.DataFrame([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])

[출력]

012012

1	2	3
4	5	6
7	8	9

아래 예제와 같이 columns를 지정하면, DataFrame의 각 열에 대한 컬럼명이 붙습니다.

pd.DataFrame([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]], columns=['가', '나', '다'])

[출력]

가나다012

1	2	3
4	5	6
7	8	9

dictionary를 통한 생성도 가능합니다.

편리한 점은 dictionary의 key 값이 자동으로 column 명으로 지정됩니다.

data = {
    'name': ['Kim', 'Lee', 'Park'], 
    'age': [24, 27, 34], 
    'children': [2, 1, 3]
}

pd.DataFrame(data)

[출력]

nameagechildren012

Kim	24	2
Lee	27	1
Park	34	3

속성

DataFrame은 다음의 속성을 가집니다.

index: index (기본 값으로 RangeIndex)
columns: column 명
values: numpy array형식의 데이터 값
dtypes: column 별 데이터 타입
T: DataFrame을 전치(Transpose)

data = {
    'name': ['Kim', 'Lee', 'Park'], 
    'age': [24, 27, 34], 
    'children': [2, 1, 3]
}

df = pd.DataFrame(data)
df

[출력]

nameagechildren012

Kim	24	2
Lee	27	1
Park	34	3

df.index

[출력]

RangeIndex(start=0, stop=3, step=1)

df.columns

[출력]

Index(['name', 'age', 'children'], dtype='object')

df.values

[출력]

array([['Kim', 24, 2],
       ['Lee', 27, 1],
       ['Park', 34, 3]], dtype=object)

데이터프레임에서의 dtypes 속성은 컬럼별 dtype을 출력합니다

df.dtypes

[출력]

name        object
age          int64
children     int64
dtype: object

df.T

[출력]

012nameagechildren

Kim	Lee	Park
24	27	34
2	1	3

index 지정

df

[출력]

nameagechildren012

Kim	24	2
Lee	27	1
Park	34	3

df.index = list('abc')
df

[출력]

nameagechildrenabc

Kim	24	2
Lee	27	1
Park	34	3

(참고) DataFrame의 indexing / slicing은 나중에 세부적으로 다루도록 하겠습니다.

column 다루기

DataFrame에 key 값으로 column의 이름을 지정하여 column을 선택할 수 있습니다.

1개의 column을 가져올 수 있으며, 1개의 column 선택시 Series가 됩니다.

df['name']

[출력]

a     Kim
b     Lee
c    Park
Name: name, dtype: object

type(df['name'])

[출력]

pandas.core.series.Series

2개 이상의 column 선택은 fancy indexing으로 가능합니다.

df[['name', 'children']]

[출력]

namechildrenabc

Kim	2
Lee	1
Park	3

(참고) column에 대한 slicing도 가능 하지만 이 부분도 나중에 다루도록 하겠습니다.

rename으로 column명 변경 가능합니다.

DataFrame.rename(columns={'바꾸고자 하는 컬럼명': '바꿀 컬럼명'})

df.rename(columns={'name': '이름'})

[출력]

이름agechildrenabc

Kim	24	2
Lee	27	1
Park	34	3

inplace=True 옵션으로 변경사항을 바로 적용할 수 있습니다.

df.rename(columns={'name': '이름'}, inplace=True)
df

[출력]

이름agechildrenabc

Kim	24	2
Lee	27	1
Park	34	3

* 연습 문제 *

다음의 DataFrame을 생성하세요

생성된 DataFrame은 df 변수에 할당합니다.

# 코드를 입력해 주세요
data = {
    'food': ['KFC', 'McDonald', 'SchoolFood'], 
    'price': [1000, 2000, 2500], 
    'rating': [4.5, 3.9, 4.2]
}

df = pd.DataFrame(data)
df

[출력]

foodpricerating012

KFC	1000	4.5
McDonald	2000	3.9
SchoolFood	2500	4.2

food 컬럼과 rating 컬럼만 선택하여 출력하세요

# 코드를 입력해 주세요
df[['food', 'rating']]

[출력]

foodrating012

KFC	4.5
McDonald	3.9
SchoolFood	4.2

food 컬럼명을 place로 컬럼명을 변경해 주세요

# 코드를 입력해 주세요
df.rename(columns={'food': 'place'}, inplace=True)
df

[출력]

placepricerating012

KFC	1000	4.5
McDonald	2000	3.9
SchoolFood	2500	4.2

'05. 빅분기 ADP > 04. ADP' 카테고리의 다른 글

데이터분석 공모전 준비 연습 (3)	2025.07.27
ADP 시험 범위 [필기] (2)	2025.07.27
리스트 컴프리헨션(List compreshension) (0)	2025.05.07
01. 파일 입출력 (0)	2025.04.05
00. 판다스(Pandas) 공식문서 (0)	2025.04.05

notion9142 님의 블로그

00. 판다스(Pandas) 기본 자료구조

Series의 생성

list로 생성한 경우

다양한 타입(type)의 데이터를 섞은 경우

인덱싱 (indexing)

fancy indexing

boolean indexing

속성 (attribute)

values

ndim - 차원

shape

NaN (Not a Number)

결측치 (NaN) 값 처리

슬라이싱

생성

속성

index 지정

column 다루기

'05. 빅분기 ADP > 04. ADP' 카테고리의 다른 글

티스토리툴바

00. 판다스(Pandas) 기본 자료구조

Series의 생성

list로 생성한 경우

다양한 타입(type)의 데이터를 섞은 경우

인덱싱 (indexing)

fancy indexing

boolean indexing

속성 (attribute)

values

ndim - 차원

shape

NaN (Not a Number)

결측치 (NaN) 값 처리

슬라이싱

생성

속성

index 지정

column 다루기

'05. 빅분기 ADP > 04. ADP' 카테고리의 다른 글

'05. 빅분기 ADP/04. ADP' Related Articles

티스토리툴바