小五的算法学习之路

聚类算法-K-means-C++实现

最后更新于：2022-04-01 20:31:26

程序流程图： ![](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-04-21_57187d6e09c55.jpg) K-means核心功能函数，首先，随机选择K-中心点（中心点坐标为簇中所有点的x坐标的平均值，y坐标的平均值，该点用于记录位置，不属于原始数据集）；循环判断中心点是否不变，若是，将二维点对信息写入clustering文件，程序结束。否则，对于每个二维数据点，选择与其距离最近的中心点，将点cluster编号更新为中心点的cluster编号。然后对于K-簇，重新计算K-中心点，进入下一个循环判断。计算簇中心是否不变可以采用SSE方式，具体实现代码中已给出，或者直接循环运行多次（不推荐）。 ~~~ /* K-means Algorithm 15S103182 Ethan */ #include #include #include #include #include #include #include #include using namespace std; /* run this program using the console pauser or add your own getch, system("pause") or input loop */ typedef struct Point{ float x; float y; int cluster; Point (){} Point (float a,float b,int c){ x = a; y = b; cluster = c; } }point; float stringToFloat(string i){ stringstream sf; float score=0; sf<>score; return score; } vector openFile(const char* dataset){ fstream file; file.open(dataset,ios::in); vector data; while(!file.eof()){ string temp; file>>temp; int split = temp.find(',',0); point p(stringToFloat(temp.substr(0,split)),stringToFloat(temp.substr(split+1,temp.length()-1)),0); data.push_back(p); } file.close(); return data; } float squareDistance(point a,point b){ return (a.x-b.x)*(a.x-b.x)+(a.y-b.y)*(a.y-b.y); } void k_means(vector dataset,int k){ vector centroid; int n=1; int len = dataset.size(); srand((int)time(0)); //random select centroids while(n<=k){ int cen = (float)rand()/(RAND_MAX+1)*len; point cp(dataset[cen].x,dataset[cen].y,n); centroid.push_back(cp); n++; } for(int i=0;i=1){ // while(time){ oSSE = nSSE; nSSE = 0; //update cluster for all the points for(int i=0;i dataset = openFile("dataset3.txt"); k_means(dataset,7); return 0; } ~~~ 数据文件格式：（x，y）运行结果格式：（x，y，cluster）具体文件格式见DBSCAN篇：http://blog.csdn.net/k76853/article/details/50440182 图形化展现： ![](https://docs.gechiui.com/gc-content/uploads/sites/kancloud/2016-04-21_57187d6e26247.jpg) 总结： K-means算法运行速度快，实现简便。但K-means算法对具有变化大小，变化密度，非圆形状等特点的数据具有局限性。解决方法是增加K的大小，增加cluster数量，使得数据的特征能够更加明显。对于数据初始中心点的选择，采用随机的方式可能无法产生理想的聚类，这时可以采用二分K-means方法，或层次聚类进行处理。