Clustering

머신러닝

Clustering

ROSEV 2021. 7. 31. 20:28

Clustering 의 정의

비지도 학습(Unsupervised learning Algorithm)의 한 종류입니다.

Clustering 목적

Clustering는 정답지가 없는 비지도 학습이기 때문에, 데이터 간에 얼마나 유사한지를 볼 수 있는 방법중 하나입니다.
또한, 데이터를 요약/정리 하는 방법중 하나이며, 정답을 보장하지 않는다는 문제가 있습니다.
그래서 실제 예측에서 쓰이기보단 EDA를 하는 방법 중 하나로 많이 쓰입니다.

Clustering 의 종류

Hierarchical

Agglomerative: 개별 포인트에서 시작 후 점점 크게 합쳐감
Divisive: 한개의 큰 cluster에서 시작후 점점 작은 클러스터로 나눠감

Point Assignment

시작 시에 cluster의 수를 정한 다음, 데이터들을 하나씩 cluster에 배정

Hard vs Soft Clustering

Hard Clustering는 데이터 당 하나의 클러스터에만 할당
Soft Clustering는 데이터가 여러 클러스터에 확률을 가지고 할당
일반적으로 HardClustering을 오늘날 클러스터링이라 말합니다.

Similarity

Euclidean, cosine, jaccard, Edit Distance, ETc.

Euclidean

K-Means Clustering

K-Means Clustering의 과정

K개의 중심점을 임의로 배정
해당 점의 주변의 데이터와의 거리 계산을 통해 중심점으로 이동
클러스터 근처에 있는 데이터를 클러스터로 할당

K-Means Cluster 실습(with Python)

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

x, y = make_blobs(n_samples = 100, centers = 3, n_features = 2)
df = pd.DataFrame(x,y).reset_index(drop=True)
df.columns = ["x","y"]

from sklearn.cluster import KMeans 

# k-means clustering 실행
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)

# 결과 확인
result = df.copy()
result["cluster"] = kmeans.labels_
result.head()

import seaborn as sns

sns.scatterplot(x="x", y="y", hue="cluster", data=result_by_sklearn, palette="Set2");

K-means에서 K를 결정하는 방법

아래의 그림처럼, 팔꿈치 모양처럼, 급격하게 꺽이는 곳을 군집의 개수로 정합니다.
y축은 각 군집별 오차의 제곱 합으로 군집내 분산으로 정의할 수 있습니다.
일반적으로 k가 증가하면, 데이터가 centroid와 가까워져, 분산이 줄어들게됩니다. 그래서
Elbow 방법으로, 급격하게 변하는 지점을 찾아서 사용합니다.

K-means에서 K를 결정하는 방법, 실습 (Python)

sum_of_squared_distances = []
K = range(1, 15)
for k in K:
    km = KMeans(n_clusters = k)
    km = km.fit(df)
    sum_of_squared_distances.append(km.inertia_)

plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from yellowbrick.cluster import KElbowVisualizer


# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,12))

visualizer.fit(x)        # Fit the data to the visualizer

위와 같은 Elbow Method이지만, 연두색으로 시간복잡도를 확인할 수 있는 방식

'머신러닝' 카테고리의 다른 글

다중선형회귀( Multiple Linear Regression) (0)	2021.08.03
단순선형회귀( Simple-Regression ) (0)	2021.08.02
High dimensional data, 차원 축소 (0)	2021.07.31
Feature Engineering, Data Manipulation (0)	2021.07.15
EDA (0)	2021.07.15

현재글Clustering

원론

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

원론

Clustering

Clustering 의 정의

Clustering 목적

Clustering 의 종류

Similarity

K-Means Clustering

K-Means Clustering의 과정

K-Means Cluster 실습(with Python)

K-means에서 K를 결정하는 방법

K-means에서 K를 결정하는 방법, 실습 (Python)

'머신러닝' 카테고리의 다른 글

'머신러닝'의 다른글

티스토리툴바

Clustering

Clustering 의 정의

Clustering 목적

Clustering 의 종류

Similarity

K-Means Clustering

K-Means Clustering의 과정

K-Means Cluster 실습(with Python)

K-means에서 K를 결정하는 방법

K-means에서 K를 결정하는 방법, 실습 (Python)

'머신러닝' 카테고리의 다른 글

'머신러닝'의 다른글

관련글

티스토리툴바