machine learning - K-Means centroids getting marginalized to having no data points [Matlab] -
machine learning - K-Means centroids getting marginalized to having no data points [Matlab] -
so have sort of unusual problem. have dataset 240 points , i'm trying utilize k-means cluster 100 clusters. i'm using matlab don't have access statistics toolbox, had write own k-means function. it's pretty simple, shouldn't hard, right? well, seems wrong code:
function result=kmeans(x,c) [n,n]=size(x); index=randperm(n); ctrs = x(index(1:c),:); old_label = zeros(1,n); label = ones(1,n); iter = 0; while ~isequal(old_label, label) old_label = label; label = assign_labels(x, ctrs); = 1:c ctrs(i,:) = mean(x(label == i,:)); if sum(isnan(ctrs(i,:))) ~= 0 ctrs(i,:) = zeros(1,n); end end iter = iter + 1; end result = ctrs; function label = assign_labels(x, ctrs) [n,~]=size(x); [c,~]=size(ctrs); dist = zeros(n,c); = 1:c dist(:,i) = sum((x - repmat(ctrs(i,:),[n,1])).^2,2); end [~,label] = min(dist,[],2);
it seems happens when go recompute centroids, centroids have no datapoints assigned them, i'm not sure that. after doing research on this, found can happen if supply arbitrary initial centroids, in case initial centroids taken datapoints themselves, doesn't create sense. i've tried re-assigning these centroids random datapoints, causes code not converge (or @ to the lowest degree after letting run night, code never converged). re-assigned, causes other centroids marginalized, , repeat. i'm not sure what's wrong code, ran same dataset through r's k-means function k=100 1000 iterations , managed converge. know i'm messing here? give thanks you.
let's step through code 1 piece @ time , discuss you're doing respect know k
-means algorithm.
function result=kmeans(x,c) [n,n]=size(x); index=randperm(n); ctrs = x(index(1:c),:); old_label = zeros(1,n); label = ones(1,n);
this looks function takes in info matrix of size n x n
, n
number of points have in dataset, while n
dimension of point in dataset. function takes in c
: desired number of output clusters.index
provides random permutation between 1
many info points have, , select @ random c
points permutation have used initialize cluster centres.
iter = 0; while ~isequal(old_label, label) old_label = label; label = assign_labels(x, ctrs); = 1:c ctrs(i,:) = mean(x(label == i,:)); if sum(isnan(ctrs(i,:))) ~= 0 ctrs(i,:) = zeros(1,n); end end iter = iter + 1; end result = ctrs;
for k
-means, maintain iterating until cluster membership of each point previous iteration matches current iteration, have going while
loop. now, label
determines cluster membership of each point in dataset. now, each cluster exists, determine mean info point is, assign mean info point new cluster centre each cluster. reason, should experience nan
dimension of cluster centre, set new cluster centre zeroes instead. this looks abnormal me, , i'll provide suggestion later. edit: understand why did this. because should have clusters empty, create cluster centre zeroes wouldn't able find mean of empty clusters. can solved suggestion duplicate initial clusters towards end of post.
function label = assign_labels(x, ctrs) [n,~]=size(x); [c,~]=size(ctrs); dist = zeros(n,c); = 1:c dist(:,i) = sum((x - repmat(ctrs(i,:),[n,1])).^2,2); end [~,label] = min(dist,[],2);
this function takes in dataset x
, current cluster centres iteration, , should homecoming label list of each point belongs each cluster. looks right because each column of dist
, calculating distance between each point each cluster, distances in ith column ith cluster. 1 optimization trick utilize avoid using repmat
here , utilize bsxfun
handles replication internally. therefore, instead:
function label = assign_labels(x, ctrs) [n,~]=size(x); [c,~]=size(ctrs); dist = zeros(n,c); = 1:c dist(:,i) = sum(bsxfun(@minus, x, ctrs(i,:)).^2, 2); end [~,label] = min(dist,[],2);
now, looks correct. ran tests myself , seems work out, provided initial cluster centres unique. 1 little problem k
-means implicitly assume cluster centres unique. should not unique, you'll run problem 2 clusters (or more) have exact same initial cluster centres.... cluster should info point assigned to? when you're doing min
in assign_labels
function, should have 2 identical cluster centres, cluster label point gets assigned minimum of these 2 numbers. why have cluster no points in it, of points should have been assigned cluster assigned other.
as such, may have 2 (or more) initial cluster centres the same upon randomization. though permutation of indices select unique, actual data points may not unique upon selection. 1 thing can impose loop on permutation until unique set of initial clusters without repeats. such, seek doing @ origin of code instead.
[n,n]=size(x); index=randperm(n); ctrs = x(index(1:c),:); while size(unique(ctrs, 'rows'), 1) ~= c index=randperm(n); ctrs = x(index(1:c),:); end old_label = zeros(1,n); label = ones(1,n); iter = 0; %// while loop appears here
this ensure have unique set of initial clusters before go on on in code. now, going nan
stuff within for
loop. i don't see how dimension result in nan
after compute mean if info doesn't have nan
begin with. suggest rid of in code (to me) doesn't useful. edit: can remove nan
check initial cluster centres should unique.
this should prepare problems you're experiencing. luck!
matlab machine-learning cluster-analysis k-means
Comments
Post a Comment