728x90

데이터 분석

탐색적 데이터 분석(EDA)란?

수집한 데이터가 들어왔을 때, 다양한 방법을 통해 자료를 관찰 및 이해하는 과정
본격적 데이터 분석 전 자료를 직관적인 방법으로 통찰하는 과정
데이터 분포 및 값을 검토함으로 데이터가 표현하는 현상을 이해하는 과정

필요성

데이터의 분포 및 값을 검토함으로써 데이터가 표현하는 현상을 이해하며 내재된 잠재적 문제에 대해 인식하고 해결안을 도출할 수 있음
- 문제점 발견시 본 분석 전 데이터의 수집 의사를 결정할 수 있음
다양한 각도에서 데이터를 살펴보는 과정을 통해 문제 정의 단계에서 인지하지 못한 새로운 양상, 패턴을 발견할 수 있다.
- 새로운 양상 발견 시 초기 설정 문제의 가설을 수정하거나 또는 새로운 가설을 수립할 수 있음

분석 과정 및 절차

1. 분석 목적과 변수 확인

개별 변수의 이름과 특성을 확인함

2. 데이터의 문제성 확인

결측치와 이상치 유무 등 확인
분포상 이상 형태를 확인

3. 데이터 개별 속성값 분포 확인

기초 통계량을 통한 데이터가 예상 범위와 분포를 가지는지 확인

4. 데이터 사이의 관계 확인

개별 속성에서 보이지 않는 상관관계 확인

개별 데이터 관찰

데이터 값을 눈으로 살펴보면서 전체적인 추세와 특이사항을 관찰
데이터 앞 / 뒤 부분 관찰, 무작위 표본 추출 사용
분석목적과 변수를 파악

데이터 문제성 확인

결측치와 이상치 유무 확인
데이터 문제성 확인 방법
- 결측치 발견: 개별 데이터 관찰, 관련 함수 활용, 상관관계 활용
- 이상치 발견: 개별 데이터 관찰, 통계값 활용, 시각화 활용, 머신러닝 기법 활용
결측치와 이상치가 왜 어떻게 발생했는지 의미를 파악하는 것이 중요
어떻게 대처할지를 판단
- 결측치 대치 방법 : 단순대치법, 다중 대치법
- 이상치 대치 방법: 제거, 대체, 유지

데이터의 개별 속성 값 분포 확인

적절한 요약 통계지표를 사용해 데이터를 이해
- 데이터 중심: 평규느 중앙값, 최빈값
- 데이터 분산: 범위, 분산, 표준편차

사분위범위 방법 사용

전체 데이터를 오름차순으로 정렬 후, 4등분하여 75%지점의 값과 25%지점의 값의 차이를 IQR로 정의
- 최대값 = 3사분위수 + 1.5 * IQR
- 최소값 = 1사분위수 - 1.5 * IQR
결정된 최대값보다 크거나 최소값보다 작은 값을 이상치로 간주

데이터의 개별 속성 값 분포 확인

정규분포 활용 방법
- 평균과 분산을 이용한 이상치 제거 방법
시각화를 통해 주어진 데이터의 개별 속성 파악
- 확률밀도 함수, 히스토그램, 박스 플롯, 산점도
- 워드 클라우드, 시계열 차트, 지도
머신러닝 기법 활용
- k-Means 기법 등

데이터의 속성 간 관계 파악

상관관계 분석
- 두 변수 간 선형적 관계가 있는지 분석하는 방법
- 관계가 없으면 독립적인 관계, 관계가 존재하면 상관된 관계임
단순상관분석: 2개의 변수가 어느 정도 강한 관계에 있는지 측정
다중상관분석: 3개 이상의 변수 간의 관계 강도를 측정
상관분석의 기본 가정
- 선형성, 동변량성(등분산성), 두 변인의 정규분포성, 무선 독립표본

1. Chipotle 데이터셋 읽어오기

구글 라이브에서 가져오기

import gdown

google_path = '###############'

file_id = '##############'

output_name = 'chipotle.tsv'

gdown.download(google_path+file_id,output_name,quiet=False)

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

chipo = pd.read_csv('./chipotle.tsv', sep = '\t')

chipo.head() # 제대로 읽어왔는지 및 내용 확인

csv는 데이터를 구분할때 ,(콤마) 로 구분

tsv는 데이터를 구분할때 tap으로 구분

chipo # 전체 읽기, info를 안써도 요즘은 전체 데이터가 잘 나옴

2. 데이터 분석

2.1 탐색: 데이터의 기초 정보 살펴보기

Chipotle 데이터셋의 기본 정보 확인

print(chipo.shape) # 형태, 크기

print(chipo.info()) # 데이터 상세정보

print(chipo.columns) # 해당 컬럼정보

print(chipo.index)# 인덱스 정보

Chipotle 데이터셋의 수치적 특징 파악

chipo['order_id'] = chipo['order_id'].astype(str) # order_id는 숫자의 의미를 가지지 않기 때문에 str으로 변환

print(chipo.head(10))

print(chipo.describe()) # chipo dataframe에서 수치형 피처들의 요약 통계량 확인

print(len(chipo['order_id'].unique())) # order_id의 개수

print(len(chipo['item_name'].unique())) # item_name의 개수

2.2 인사이트 발견: 탐색과 시각화

가장 많이 주문한 아이템

item_count = chipo['item_name'].value_counts()[:10] # 가장 많이 나온 것중 10개만, 중복 주문이 있기때문에 다를수 있다.

print(item_count)

for idx, (val, cnt) in enumerate(item_count.iteritems(), 1):

print("Top", idx, ":", val, cnt)

chipo['item_name'].value_counts().index.tolist()[0] # 리스트로 바꾸고 0번 해당하는것 출력

아이템당 주문 개수와 총량 구하기

# item당 주문 개수

order_count = chipo.groupby('item_name')['order_id'].count() # 아이템별로 그룹바이 묶어줌

print(order_count)

order_count[:10]

# item당 주문 총량

item_quantity = chipo.groupby('item_name')['quantity'].sum() #아이템 당이므로 item_name을 가져옴

print(item_quantity)

item_quantity[:10]

시각화를 통해 분석 결과 살펴보기

item_name_list = item_quantity.index.tolist()

x_pos = np.arange(len(item_name_list))

order_cnt = item_quantity.values.tolist()

plt.bar(x_pos, order_cnt, align='center')

plt.ylabel('ordered_item_count')

plt.title('Distribution of all orderd item')

plt.show()

2.3 데이터 전처리: 나만의 조력자를 정의하자

apply와 lambda 함수를 이용한 데이터 전처리

print(chipo.info())

chipo['item_price'].head()

# item_price가 앞에 달러때문에 문자로 인식됨으로 숫자로 변경해줘야함

# column 단위 데이터에 apply 함수로 전처리 적용, 달러표시 없앰

chipo['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:]))

print(chipo['item_price'])

chipo['item_price'].head()

chipo.describe()

2.4 탐색적 분석: 스무고개로 분석하는 개념적 탐색

데이터를 이해하기 위한 조금 더 복잡한 질문들로 탐색적 데이터 분석 연습하기
- 주문당 평균 계산금액 출력
- 한 주문에 10달러 이상 사용한 주문의 id들 출력
- 각 아이템의 가격 구하기
- 가장 비싼 주문에서 item이 몇개 팔렸는지 구하기
- 'Veggie Salad Bowl'이 몇 번 주문되었는지 구하기
- 'Chicken Bowl'을 2개 이상 주문한 주문 횟수 구하기

주문당 평균 계산금액 출력하기
주문당이므로 order_id를 그룹바이해서 진행

# 주문당 평균 계산금액 출력

chipo.groupby('order_id')['item_price'].sum().mean()

chipo.groupby('order_id')['item_price'].sum().describe()[:10] # 기초통계량 확인

한 주문에 10달러 이상 사용한 주문 번호(id) 출력하기

# 한 주문에 10달러 이상 사용한 id 출력

chipo_orderid_group = chipo.groupby('order_id').sum()

chipo_orderid_group

results = chipo_orderid_group[chipo_orderid_group.item_price >= 20] # 아이템 가격을 가져와서 20달러 이상만 표시

results

print(results[:10]) #10개만 추출

print(results.index.values) # 그 벨류들만 나열

각 아이템의 가격 구하기

# 각 아이템의 가격 계산

chipo_one_item = chipo[chipo.quantity == 1] # 하나만 있는 것들을 가져옴

print(chipo_one_item)

price_per_item = chipo_one_item.groupby('item_name').min()

print(price_per_item)

각 원소대 원소로 연산된다.

chipo["new_item_price"] = chipo["item_price"]/chipo["quantity"]

아이템 값을 min값으로 골라내고 그 값을 sortting해서 사용

price_per_item = chipo.groupby('item_name').min()

price_per_item.sort_values(by=["item_price"], ascending=False)[:20]

10까지로 재 sort한 후 그래프 그리기

price_per_item.sort_values(by=["item_price"], ascending=False)[:10]

price_per_item

단순한 내용은 그래프 그리는것이 연관성 파악에 그게 도움이 되지는 않지만, 다른 경우는 도움이 됨으로 한번 학습햅봄\

# 아이템 가격 분포 그래프 출력

item_name_list = price_per_item.index.tolist()

x_pos = np.arange(len(item_name_list))

item_price = price_per_item['item_price'].tolist() # 아이템 가격을 리스트화

plt.bar(x_pos, item_price, align='center')

plt.ylabel('item price($)')

plt.title('Distribution of item price')

plt.show()

히스토그램 그리기 : 우리가 지정한 특정 구간에 대한 히스토그램 출력
구간이 없는 그래프는 히스토그램이 아닌 막대그래프이다.

# 아이템 가격 히스토그램 출력

plt.hist(item_price)# plt.hist(item_price, bins=50) 50개로 나누기 / 이런식으로 세분화 해줄 수 있다.

plt.ylabel('counts')

plt.title('Histogram of item price')

plt.show()

가장 비싼 주문에서 item이 총 몇개 팔렸는지 구하기
주문번호를 기준으로 그룹바이

chipo.groupby('order_id').sum().sort_values(by='item_price', ascending=False)[:5]

'Veggie Salad Bowl'이 몇 번 주문되었는지 구하기

# “Veggie Salad Bowl”이 몇 번 주문되었는지 계산

chipo_salad = chipo[chipo['item_name'] == "Veggie Salad Bowl"] # 대소문자 구분함

print(chipo_salad)

print(len(chipo_salad)) # 전체 개수보기위해 len함수 이용

chipo_salad = chipo_salad.drop_duplicates(['item_name', 'order_id']) # 한 주문 내에서 중복 집계된 item_name을 제거

print(chipo_salad)

print(len(chipo_salad))

chipo_salad.head(5)

"Chicken Bowl"을 2개 이상 주문한 주문 횟수 구하기

# “Chicken Bowl”을 2개 이상 주문한 주문 횟수 계산

chipo_chicken = chipo[chipo['item_name'] == "Chicken Bowl"]

print(chipo_chicken)

chipo_chicken_result = chipo_chicken[chipo_chicken['quantity'] >= 2]

print(chipo_chicken_result)

# “Chicken Bowl”을 2개 이상 주문한 고객들의 "Chicken Bowl" 메뉴의 총 주문 수량 계산

chipo_chicken = chipo[chipo['item_name'] == "Chicken Bowl"]

print(chipo_chicken)

chipo_chicken_ordersum = chipo_chicken.groupby('order_id').sum()['quantity']

print(chipo_chicken_ordersum)

chipo_chicken_result = chipo_chicken_ordersum[chipo_chicken_ordersum >= 2] # 총 두개 이상 주문된 것들 확인

print(chipo_chicken_result)

print(len(chipo_chicken_result))

chipo_chicken_result.head(5)

2. 전 세계 음주 데이터 분석하기

1. drinks 데이터셋 읽어오기

import gdown

google_path = '###############'

file_id = '##############'

output_name = 'drinks.csv'

gdown.download(google_path+file_id,output_name,quiet=False)

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

drinks = pd.read_csv('./drinks.csv')

drinks.head(10)

2. 데이터 분석

2.1 탐색: 데이터의 기초 정보 살펴보기

print(drinks.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
#   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
0   country                       193 non-null    object
1   beer_servings                 193 non-null    int64
2   spirit_servings               193 non-null    int64
3   wine_servings                 193 non-null    int64
4   total_litres_of_pure_alcohol  193 non-null    float64
5   continent                     170 non-null    object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
None

drinks.describe() # 수치 데이터 중심으로 살펴보기

2.2 인사이트의 발견: 탐색과 시각화

특징(Features) 간의 상관관계 탐색

연습
'beer_servings', 'wine_servings' 두 피처간의 상관계수 계산
pearson은 상관계수를 구하는 계산 방법 중 하나를 의미하며, 가장 널리 쓰이는 방법

corr = drinks[['beer_servings', 'wine_servings']].corr(method = 'pearson')

corr

# 피처간의 상관계수 행렬 계산

cols = ['beer_servings', 'spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol']

corr = drinks[cols].corr(method = 'pearson')

corr

corr 행렬 히트맵 시각화 진행

import seaborn as sns

cols_view = ['beer', 'spirit', 'wine', 'alcohol'] # 그래프 출력을 위한 cols 이름 축약

sns.set(font_scale=1.5)

hm = sns.heatmap(corr.values,

cbar=True,

annot=True,

square=True,

fmt='.2f', #소수점 2자리 실수형으로

annot_kws={'size': 15},

yticklabels=cols_view,

xticklabels=cols_view)

plt.tight_layout()

plt.show()

# 시각화 라이브러리를 이용한 피처간의 scatter plot 출력

sns.set(style='whitegrid', context='notebook')

sns.pairplot(drinks[['beer_servings', 'spirit_servings',

'wine_servings', 'total_litres_of_pure_alcohol']], height=2.5)

plt.show()

2.3 탐색적 분석: 스무고개로 분석하는 개념적 탐색

결측 데이터 전처리
- continent column에 대한 결측 데이터 처리 과정

print(drinks.isnull().sum()) #결측치 확인

print(drinks.dtypes) # 데이터 타입 확인

결측치 처리

# 결측데이터 처리 : 기타 대륙으로 통합 -> 'OT'

drinks['continent'] = drinks['continent'].fillna('OT')

drinks.head(10)

파이차트 시각화

labels = drinks['continent'].value_counts().index.tolist()

fracs1 = drinks['continent'].value_counts().values.tolist()

explode = (0, 0, 0, 0.25, 0, 0)

plt.pie(fracs1, explode=explode, labels=labels, autopct='%.0f%%', shadow=True)

plt.title('null data to \'OT\'')

plt.show()

그룹 단위의 데이터 분석 : 대륙별 분석
- apply, agg 함수를 이용한 대륙별 분석

# 대륙별 spirit_servings의 평균, 최소, 최대, 합계 계산

result = drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max', 'sum'])

result.head()

# 전체 평균보다 많은 알코올을 섭취하는 대륙 검출

total_mean = drinks.total_litres_of_pure_alcohol.mean()

print(total_mean)

continent_mean = drinks.groupby('continent')['total_litres_of_pure_alcohol'].mean()

print(continent_mean)

continent_over_mean = continent_mean[continent_mean >= total_mean]

print(continent_over_mean)

# 평균 beer_servings이 가장 높은 대륙 검출

beer_continent = drinks.groupby('continent').beer_servings.mean().idxmax()

print(beer_continent)

분석 결과에 대한 시각화(바 그래프 이용, 한 폭을 0.1)

# 대륙별 spirit_servings의 평균, 최소, 최대, 합계 시각화

n_groups = len(result.index)

means = result['mean'].tolist()

mins = result['min'].tolist()

maxs = result['max'].tolist()

sums = result['sum'].tolist()

index = np.arange(n_groups)

bar_width = 0.1

rects1 = plt.bar(index, means, bar_width,

color='r',

label='Mean')

rects2 = plt.bar(index + bar_width, mins, bar_width,

color='g',

label='Min')

rects3 = plt.bar(index + bar_width * 2, maxs, bar_width,

color='b',

label='Max')

rects4 = plt.bar(index + bar_width * 3, sums, bar_width,

color='y',

label='Sum')

plt.xticks(index, result.index.tolist())

plt.legend()

plt.show()

# 대륙별 total_litres_of_pure_alcohol 시각화

continents = continent_mean.index.tolist()

continents.append('mean')

x_pos = np.arange(len(continents))

alcohol = continent_mean.tolist()

alcohol.append(total_mean)

bar_list = plt.bar(x_pos, alcohol, align='center', alpha=0.5)

bar_list[len(continents) - 1].set_color('r')

plt.plot([0., 6], [total_mean, total_mean], "k--")

plt.xticks(x_pos, continents)

plt.ylabel('total_litres_of_pure_alcohol')

plt.title('total_litres_of_pure_alcohol by Continent')

plt.show()

# 대륙별 beer_servings 시각화

beer_group = drinks.groupby('continent')['beer_servings'].sum()

continents = beer_group.index.tolist()

y_pos = np.arange(len(continents))

alcohol = beer_group.tolist()

bar_list = plt.bar(y_pos, alcohol, align='center', alpha=0.5)

bar_list[continents.index("EU")].set_color('r')

plt.xticks(y_pos, continents)

plt.ylabel('beer_servings')

plt.title('beer_servings by Continent')

plt.show()

2.4 통계적 분석: 분석 대상간의 통계적 차이 검정하기

아프리카와 유럽간의 맥주 소비량 차이 검정하기

drinks['continent']=='AF'

# 아프리카와 유럽간의 맥주 소비량 차이 검정

africa = drinks.loc[drinks['continent']=='AF']

europe = drinks.loc[drinks['continent']=='EU']

from scipy import stats

tTestResult = stats.ttest_ind(africa['beer_servings'],

europe['beer_servings'])

tTestResultDiffVar = stats.ttest_ind(africa['beer_servings'],

europe['beer_servings'], equal_var=False)

print("The t-statistic and p-value assuming equal \

variances is %.3f and %.3f." % tTestResult)

print("The t-statistic and p-value not assuming \

equal variances is %.3f and %.3f" % tTestResultDiffVar)

대한민국은 얼마나 술을 독하게 마시는 나라일까?

# total_servings 피처 생성, 전체 마신량

drinks['total_servings'] = drinks['beer_servings'] + drinks['wine_servings'] + drinks['spirit_servings']

# 술 소비량 대비 알콜 비율 피처 생성, 알코올 포함 퍼센트

drinks['alcohol_rate'] = drinks['total_litres_of_pure_alcohol'] / drinks['total_servings']

drinks['alcohol_rate'] = drinks['alcohol_rate'].fillna(0)

# 순위 정보 생성

country_with_rank = drinks[['country', 'alcohol_rate']]

country_with_rank = country_with_rank.sort_values(by=['alcohol_rate'], ascending=0)

country_with_rank.head(5)

# 국가별 순위 정보를 그래프로 시각화

country_list = country_with_rank.country.tolist()

x_pos = np.arange(len(country_list))

rank = country_with_rank.alcohol_rate.tolist()

bar_list = plt.bar(x_pos, rank)

bar_list[country_list.index("South Korea")].set_color('r')

plt.ylabel('alcohol rate')

plt.title('liquor drink rank by contry')

plt.axis([0, 200, 0, 0.3])

korea_rank = country_list.index("South Korea")

korea_alc_rate = country_with_rank[country_with_rank['country'] ==

'South Korea']['alcohol_rate'].values[0]

plt.annotate('South Korea : ' + str(korea_rank + 1),

xy=(korea_rank, korea_alc_rate),

xytext=(korea_rank + 10, korea_alc_rate + 0.05),

arrowprops=dict(facecolor='red', shrink=0.05))

plt.show()

728x90

'PYTHON-BACK' 카테고리의 다른 글

#파이썬 기초 15일차 (0)	2024.07.19
#파이썬 기초 14일차 (0)	2024.07.18
#파이썬 기초 12일차 (3)	2024.07.16
#파이썬 기초 11일차 (0)	2024.07.11
#파이썬 기초 10일차_2 (0)	2024.07.10

Astero블로그

#파이썬 기초 13일차

1. Chipotle 데이터셋 읽어오기

2. 데이터 분석

2.2 인사이트 발견: 탐색과 시각화

2.3 데이터 전처리: 나만의 조력자를 정의하자

2.4 탐색적 분석: 스무고개로 분석하는 개념적 탐색

2. 전 세계 음주 데이터 분석하기

1. drinks 데이터셋 읽어오기

2. 데이터 분석

2.1 탐색: 데이터의 기초 정보 살펴보기

2.2 인사이트의 발견: 탐색과 시각화

2.3 탐색적 분석: 스무고개로 분석하는 개념적 탐색

2.4 통계적 분석: 분석 대상간의 통계적 차이 검정하기

'PYTHON-BACK' 카테고리의 다른 글

티스토리툴바

#파이썬 기초 13일차

1. Chipotle 데이터셋 읽어오기

2. 데이터 분석

2.2 인사이트 발견: 탐색과 시각화

2.3 데이터 전처리: 나만의 조력자를 정의하자

2.4 탐색적 분석: 스무고개로 분석하는 개념적 탐색

2. 전 세계 음주 데이터 분석하기

1. drinks 데이터셋 읽어오기

2. 데이터 분석

2.1 탐색: 데이터의 기초 정보 살펴보기

2.2 인사이트의 발견: 탐색과 시각화

2.3 탐색적 분석: 스무고개로 분석하는 개념적 탐색

2.4 통계적 분석: 분석 대상간의 통계적 차이 검정하기

'PYTHON-BACK' 카테고리의 다른 글

'PYTHON-BACK' Related Articles

티스토리툴바