[Pytorch]
[Pandas 기초]Section 04.Pandas Exercise
이산이
2022. 1. 7. 18:57
In [1]:
import pandas as pd
In [2]:
# 데이터 불러오기
df = pd.read_csv('/content/drive/MyDrive/Study/Pytorch/PYTORCH_NOTEBOOKS/00-Crash-Course-Topics/01-Crash-Course-Pandas/bank.csv')
In [3]:
# 5개의 row만 출력하기
df.head()
Out[3]:
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
In [4]:
# 데이터의 age 칼럼 평균 구하기
df['age'].mean()
Out[4]:
41.17009511170095
In [5]:
# 가장 나이가 어린 사람의 혼인여부 찾기
df['marital'][df['age'].idxmin()]
Out[5]:
'single'
In [6]:
# 풀이
# 가장 나이가 어린 사람 찾기
df['age'].min()
Out[6]:
19
In [10]:
df[df['age']==19]
Out[10]:
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
503 | 19 | student | single | primary | no | 103 | no | no | cellular | 10 | jul | 104 | 2 | -1 | 0 | unknown | yes |
1900 | 19 | student | single | unknown | no | 0 | no | no | cellular | 11 | feb | 123 | 3 | -1 | 0 | unknown | no |
2780 | 19 | student | single | secondary | no | 302 | no | no | cellular | 16 | jul | 205 | 1 | -1 | 0 | unknown | yes |
3233 | 19 | student | single | unknown | no | 1169 | no | no | cellular | 6 | feb | 463 | 18 | -1 | 0 | unknown | no |
가장 나이가 어린 사람 모두 single인 상태임.
그중 제일 첫 번째 사람의 index는 503.
따라서 'marital'칼럼에서 503번째 사람을 return하면 정답임.
In [11]:
df['marital'][503]
Out[11]:
'single'
503이라는 index는 위의 과정을 거쳐 찾을 수도 있지만, idxmin()이라는 메소드를 통해서도 찾을 수 있음.
In [12]:
df['marital'][df['age'].idxmin()]
Out[12]:
'single'
In [13]:
# 직업 카테고리의 유일 갯수 구하기
df['job'].nunique()
Out[13]:
12
In [14]:
# 각 직업군에 해당하는 인원이 몇명인지 구하기
df['job'].value_counts()
Out[14]:
management 969 blue-collar 946 technician 768 admin. 478 services 417 retired 230 self-employed 183 entrepreneur 168 unemployed 128 housemaid 112 student 84 unknown 38 Name: job, dtype: int64
In [16]:
# 데이터셋에서 결혼 한 사람의 비율 구하기
(len(df[df['marital']=='married'])/len(df))*100
Out[16]:
61.86684361866843
In [17]:
# 풀이
# 'marital'칼럼에서 'married' 즉 결혼한 인원을 조건으로 만듦
df['marital']=='married'
Out[17]:
0 True 1 True 2 False 3 True 4 True ... 4516 True 4517 True 4518 True 4519 True 4520 False Name: marital, Length: 4521, dtype: bool
조건식이 되었기 때문에 이를 df[]안에 넣어주면 조건에 해당되는 인원만 표기 가능
In [18]:
df[df['marital']=='married']
Out[18]:
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
6 | 36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4514 | 38 | blue-collar | married | secondary | no | 1205 | yes | no | cellular | 20 | apr | 45 | 4 | 153 | 1 | failure | no |
4516 | 33 | services | married | secondary | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | -1 | 0 | unknown | no |
4517 | 57 | self-employed | married | tertiary | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | -1 | 0 | unknown | no |
4518 | 57 | technician | married | secondary | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | -1 | 0 | unknown | no |
4519 | 28 | blue-collar | married | secondary | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211 | 3 | other | no |
2797 rows × 17 columns
총 2797개의 row를 가지게 되었으므로, 이를 len()을 활용하여 갯수로 변환함.
In [19]:
len(df[df['marital']=='married'])
Out[19]:
2797
데이터 셋의 총 인원 중 결혼한 인원의 갯수를 확인 하였으니, 이를 총 인원수로 나누어 비율을 계산함.
In [20]:
100*len(df[df['marital']=='married'])/len(df)
Out[20]:
61.86684361866843
In [22]:
# .apply를 활용하여 기혼 여부를 's' 또는 'm'으로 나타내는 칼럼 'marital code' 만들기
df['marital code']=df['marital'].apply(lambda status: status[0])
df
Out[22]:
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | marital code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no | m |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no | m |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no | s |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no | m |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no | m |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4516 | 33 | services | married | secondary | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | -1 | 0 | unknown | no | m |
4517 | 57 | self-employed | married | tertiary | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | -1 | 0 | unknown | no | m |
4518 | 57 | technician | married | secondary | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | -1 | 0 | unknown | no | m |
4519 | 28 | blue-collar | married | secondary | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211 | 3 | other | no | m |
4520 | 44 | entrepreneur | single | tertiary | no | 1136 | yes | yes | cellular | 3 | apr | 345 | 2 | 249 | 7 | other | no | s |
4521 rows × 18 columns
In [23]:
# 'duration' 칼럼에서 가장 큰 값 찾기
df['duration'].max()
Out[23]:
3025
In [24]:
# 직업이 없는 사람들의 교육 현황 구하기
df[df['job']=='unemployed']['education'].value_counts()
Out[24]:
secondary 68 tertiary 32 primary 26 unknown 2 Name: education, dtype: int64
In [25]:
# 직업이 없는 사람들의 나이 평균 구하기
df[df['job']=='unemployed']['age'].mean()
Out[25]:
40.90625
In [ ]: