본문 바로가기

AI/Kaggle

Breast Cancer Wisconsin (Diagnostic) Data Set

반응형

df.info()

데이터 포맷 확인.

metrics.accuracy_score(pre_1,y_test)*100

예측 값과 테스트 값의 비교로 성능확인.

Pd.Series -> 1차원 배열.

Value로 들어갈 값 넣어준 후에 Index를 바꿔 줄 수 있음. Sort_values는 당연히 value기준으로 정렬됨.

 

 

 

 

 

import pandas as pd
import numpy as np
from sklearn import metrics


train= pd.read_csv("data.csv")


#B양성 M악성

train = train.drop(["Unnamed: 32"],1)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train["diagnosis"] = le.fit_transform(train["diagnosis"])


y = train["diagnosis"]

train = train.drop(["diagnosis","id"],1)



from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(train,y,test_size=0.2,random_state=2019,stratify=y)


from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)

rf.fit(x_train,y_train)
pre_1 = rf.predict(x_test)

print("%.2f" % (metrics.accuracy_score(pre_1,y_test)*100))

importance = pd.Series(rf.feature_importances_,index=train.columns).sort_values(ascending=False)

result = importance.keys()[:5]

rf.fit(x_train[result],y_train)
pre_1 = rf.predict(x_test[result])

반응형