Data science training in pune

Loading...

3/22/2012

K-means Algorithm g Cluster Analysis in Data Mining

Presented by Zijun Zhang

Algorithm Description 

What is Cluster Analysis? Cluster analysis groups data objects based only on information found in data that describes the objects and their relationships.

Goal of Cluster Analysis The objects j within a group g p be similar to one another and different from the objects in other groups

1

3/22/2012

Algorithm Description 

Types of Clustering Partitioning and Hierarchical Clustering



Hierarchical Clustering - A set of nested clusters organized as a hierarchical tree



Partitioningg Clusteringg - A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

Algorithm Description

p1 p3 p4 p2

A Partitional Clustering

Hierarchical Clustering

2

3/22/2012

Algorithm Description 

What is K-means? 1. Partitional clustering approach 2. Each cluster is associated with a centroid (center point) 3. Each point is assigned to the cluster with the closest centroid 4 Number of clusters 4. clusters, K K, must be specified

Algorithm Statement 

Basic Algorithm of K-means

3

3/22/2012

Algorithm Statement 

Details of K-means

1 Initial centroids are often chosen randomly. 1. randomly - Clusters produced vary from one run to another 2. The centroid is (typically) the mean of the points in the cluster. 3.‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. 4. K-means will converge for common similarity measures mentioned above. 5. Most of the convergence happens in the first few iterations. - Often the stopping condition is changed to ‘Until relatively few points change clusters’

Algorithm Statement 

Euclidean Distance

A simple example: Find the distance between two points, the original and the point (3,4)

4

3/22/2012

Algorithm Statement 

Update Centroid We use the following equation to calculate the n dimensional centroid point amid k n-dimensional points

Example: Find the centroid of 3 2D points, (2,4), (5,2) and (8,9)

Example of K-means Select three initial centroids Iteration 1 3

2.5

2

1.5

y



1

0.5 0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

5

3/22/2012

Example of K-means 

Assigning the points to nearest K clusters and re-compute the centroids Iteration 3 3

2.5

2

y

1.5

1

0.5 0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

Example of K-means K-means terminates since the centroids converge to certain points and do not change. Iteration 6 3

2.5

2

1.5

y



1

0.5 0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

6

3/22/2012

Example of K-means Iteration 1

Iteration 2

Iteration 3

3

3

3

2.5

2.5

2.5

y

2

1.5

y

2

1.5

y

2

1.5

1

1

1

0.5

0.5

0.5

0

0

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0

-2

-1.5

-1

-0.5

x

0

0.5

1

1.5

2

-2

Iteration 4

Iteration 5 3

3

2.5

2.5

1

1

1

0.5

0.5

0.5

0

0

-0.5

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1

1.5

2

y

2

1.5

y

2

1.5

y

2

1.5

-1

-0.5

Iteration 6

3

-1.5

-1

x

2.5

-2

-1.5

x

0

-2

-1.5

-1

x

-0.5

0

0.5

x

1

1.5

2

-2

-1.5

-1

-0.5

0

0.5

x

Example of K-means 

Demo of K-means

7

3/22/2012

Evaluating K-means Clusters 

Most common measure is Sum of Squared Error (SSE)  For each point, the error is the distance to the nearest cluster  To get SSE, SSE we square these errors and sum them them. K

SSE    dist 2 ( mi , x ) i 1 xCi



 

x is a data point in cluster Ci and mi is the representative point for cluster Ci  can show that mi corresponds to the center (mean) of the cluster Given two clusters clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters  A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Problem about K 

How to choose K? 1. Use another clustering method, like EM. 2. Run algorithm on data with several different values of K. 3. Use the prior knowledge about the characteristics of the problem.

8

3/22/2012

Problem about initialize centers 

How to initialize centers? - Random Points in Feature Space - Random Points From Data Set - Look For Dense Regions of Space - Space them uniformly around the feature space

Cluster Quality

9

3/22/2012

Cluster Quality

Limitation of K-means 

K-means has problems when clusters are of differing g  Sizes  Densities  Non-globular shapes



K-means has K h problems bl when h the h d data contains i outliers.

10

3/22/2012

Limitation of K-means

Original Points

K-means (3 Clusters)

Application of K-means 

Image Segmentation The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition.

11

3/22/2012

K-means in Wind Energy Clustering can be applied to detect abnormality b lit iin wind i dd data t ((abnormal b l vibration)  Monitor Wind Turbine Conditions  Beneficial to preventative maintenance  K-means K means can be more powerful and applicable after appropriate modifications 

K-means in Wind Energy Modified K-means

12

3/22/2012

K-means in Wind Energy 

Clustering cost function d (k , x, c) 

1 k     x j  ci n i 1  x j Ci

2

   

k

n   mi i 1

d (k , x, c) 

k

1

m i 1



  

k

i 1

i

 x j Ci

2 x j  ci   

K-means in Wind Energy Determination of k value 0.09 0.08 0.07 Cost of clustering



0.06 0.05 0.04 0.03 0 02 0.02 0.01 0 2

3

4

5

6

7

8

9

10

11

12

13

Number of clusters

13

3/22/2012

K-means in Wind Energy 

Summary of clustering result No. of Cluster

c1 (Drive train acc.)

c2 (Wind speed)

Number of points

Percentage (%)

1

71.9612

9.97514

313

8.75524

2

65.8387

9.42031

295

8.25175

3

233.9184

9.57990

96

2.68531

4

17.4187

7.13375

240

6.71329

5

3.3706

8.99211

437

12.22378

6

0.3741

0.40378

217

6.06993

7

18.1361

8.09900

410

11.46853

8

0.7684

10.56663

419

11.72028

9

62.0493

8.81445

283

7.91608

10

81.7522

10.67867

181

5.06294

11

83.8067

8.10663

101

2.82517

12

0.9283

9.78571

583

16.30769

K-means in Wind Energy 

Visualization of monitoring result

14

3/22/2012

K-means in Wind Energy 

Visualization of vibration under normal condition 14

Wind speed (m/s)

12 10 8 6 4 2 0 0

20

40

60

80

100

120

140

Drive train acceleration

Reference 1. Introduction to Data Mining, P.N. Tan, M. Steinbach, V. Kumar, Addison Wesley 2. An efficient k-means clustering algorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Trans. PatternAnalysis and Machine Intelligence, 24 (2002), 881-892 3. http://www.cs.cmu.edu/~cga/ai-course/kmeans.pdf 4. http://www.cse.msstate.edu/~url/teaching/CSE6633Fall08/lec16%20k-means.pdf

15

3/22/2012

Appendix One

Original Points

K-means (2 Clusters)

Appendix Two

Original Points

K-means Clusters

One solution is to use many clusters. Find parts of clusters, but need to put together.

16

Loading...

Data science training in pune

3/22/2012 K-means Algorithm g Cluster Analysis in Data Mining Presented by Zijun Zhang Algorithm Description  What is Cluster Analysis? Cluster a...

665KB Sizes 62 Downloads 0 Views

Recommend Documents

data science in pune
Data for a Data Scientist is what Oxygen is to Human Beings. This is also a profession where statistical adroit works

data science course in pune
Data Science course in Pune, the most comprehensive Data Science course in the market, covering the complete Data Scienc

data analytics certification training in pune
Data Science certification training course from ExcelR equips you with essential Data Science skills to make you a succe

Data Science training in Hyderabad
ExcelR offers Data Science course in Hyderabad, the most comprehensive Data Science course in the market, covering the c

Data science training in bangalore
Business Analytics or Data Analytics or Data Science certification course is an extremely high-in-demand profession whic

data science training in kolkata
Data Science is all about mining hidden insights of data pertaining to trends, behaviour, interpretation and inferences

data science training in hyderabad
Business Analytics or Data Analytics or Data Science certification course is an extremely high-in-demand profession whic

Data Science Training in Delhi
Excelr is the best technology training Data Science Training certification providing all the resources for youreffectiv

data science training in hyderabad
ExcelR offers Data Science course in Hyderabad, the most comprehensive Data Science course in the market, covering the

Data science training in Gurgaon
Data Science is all about mining hidden insights of data pertaining to trends, behaviour, interpretation and inferences