{"id":657,"date":"2023-11-01T16:51:24","date_gmt":"2023-11-01T08:51:24","guid":{"rendered":"https:\/\/www.wennroy.com\/?p=657"},"modified":"2023-11-01T16:51:25","modified_gmt":"2023-11-01T08:51:25","slug":"clustering-is-always-complicated","status":"publish","type":"post","link":"https:\/\/wennroy.com\/index.php\/2023\/11\/01\/clustering-is-always-complicated\/","title":{"rendered":"Clustering is always Complicated"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">\u672c\u6587\u5927\u90e8\u5206\u5185\u5bb9\u6765\u81ea\u4e8e<a href=\"https:\/\/towardsdatascience.com\/tuning-with-hdbscan-149865ac2970\">How to tune with hdbscan<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u7b80\u5355\u7684\u5bfc\u8a00<\/strong><\/h2>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Clustering is &#8220;strongly dependent on contexts, aims and decisions of the researcher&#8221; which adds fire to the argument that there is no such thing as a &#8220;universally optimal method that will just produce natural clusters&#8221;<\/p><cite>Henning in&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/arxiv.org\/abs\/1502.02555\" target=\"_blank\">What Are True Clusters? Henning 2015<\/a><\/cite><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">\u805a\u7c7b\u4e8b\u5b9e\u4e0a\u975e\u5e38\u7684\u8003\u9a8c\u5c1d\u8bd5\u4e0e\u8c03\u8bd5\uff0c\u4ee5\u53ca\u672c\u8eab\u805a\u7c7b\u60f3\u8981\u505a\u51fa\u7684\u4e8b\u60c5\u548c\u76ee\u7684\u3002\u5e76\u4e14\uff0c\u4e00\u4e2a\u9002\u7528\u5728\u67d0\u4e2adataset\u4e0a\u7684\u805a\u7c7b\uff0c\u4e0d\u4e00\u5b9a\u4f1a\u9002\u7528\u4e8e\u53e6\u4e00\u4e2adataset\u3002\u540c\u65f6\uff0c\u805a\u7c7b\u7b97\u6cd5\u4e5f\u662f\u6709<a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/No_free_lunch_in_search_and_optimization\" target=\"_blank\">no free lunch<\/a> theorem\uff0c\u56e0\u6b64\uff0c\u5982\u4f55\u641e\u6e05\u695a\u805a\u7c7b\u7684\u5047\u8bbe\u548c\u601d\u8003\u53d8\u5f97\u6bd4\u8f83\u91cd\u8981\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>\u4e4b\u524d\u78b0\u5230\u7684\u4e00\u4e9b\u9519\u8bef<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">\u4e00\u822c\u6765\u8bf4\uff0c\u6211\u4eec\u5224\u65ad\u4e00\u4e2a\u805a\u7c7b\u805a\u5f97\u597d\u4e0d\u597d\u4f1a\u7528\u4ee5\u4e0b\u4e24\u79cd\u65b9\u6cd5\uff0c\u4e00\u79cd\u662f\u8ba1\u7b97\u7ec4\u5185\u5e73\u65b9\u548c(<strong>WCSS<\/strong>, <strong>within cluster sum of squares<\/strong>)\uff0c\u53e6\u4e00\u79cd\u662f\u8ba1\u7b97<a href=\"https:\/\/en.wikipedia.org\/wiki\/Silhouette_(clustering)\">Silhouette Score<\/a> (BTW, <a href=\"https:\/\/www.youtube.com\/watch?v=zVgKnfN9i34\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/watch?v=zVgKnfN9i34\">Silhouette &#8211; KANA BOON<\/a>\u786e\u5b9e\u5f88\u597d\u542c)\u3002\u4e8c\u8005\u90fd\u662f\u62ff\u6765\u5c55\u793a\u805a\u7c7b\u597d\u574f\u7684\u4e00\u79cd\u6307\u6807\uff0c\u5f53\u7136\u6211\u4eec\u5fc5\u987b\u7262\u8bb0\u5728\u5fc3\u7684\u662f\uff0c\u8fd9\u4e2a\u6807\u51c6\u597d\u574f\u4ec5\u9488\u5bf9\u4e8e\u67d0\u4e00\u4e2a\u6570\u636e\u96c6\u4e0e\u76ee\u7684\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u4e4b\u524d\u6211\u78b0\u5230\u7684\u9519\u8bef\u4e3b\u8981\u662f\u4f7f\u7528Silhouette Score\u6765\u8ba1\u7b97Density-Based\u7684cluster method\u3002\u4f8b\u5982HDBSCAN\uff0c\u8fd8\u6709\u4e9b\u5176\u4ed6\u7684\u57fa\u4e8e\u5bc6\u5ea6\u805a\u7c7b\u7684\u65b9\u6cd5\uff0c\u4f8b\u5982OPTICS\u3001DENCLUE\u3002\u5b9e\u9645\u4e0a\uff0c\u8fd9\u4e24\u7c7b\u6307\u6807\u90fd\u5728\u8861\u91cf\u805a\u7c7b\u7c07\u7684\u51dd\u805a\u529b\u548c\u5206\u79bb\u5ea6(cohensiveness and separation)\uff0c\u662f\u5b8c\u5168\u4f9d\u9760\u4e8e\u8ddd\u79bb\u6765\u8ba1\u7b97\u7684\u3002\u4f46\u662f\u8fd9\u5728Density-based\u7684method\u4e2d\uff0c\u662f\u5b8c\u5168\u4e0d\u53ef\u884c\u7684\u3002\u57fa\u4e8e\u5bc6\u5ea6\u7684\u805a\u7c7b\u4e0d\u4f1a\u8fc7\u5206\u7684\u4f9d\u8d56\u4e8e\u8ddd\u79bb\u7684\u8ba1\u7b97\uff0c\u5bc6\u5ea6\u7684\u805a\u7c7b\u53ea\u6839\u636e\u8ddd\u79bb\u6765\u8ba1\u7b97\u805a\u7c7b\u7ed3\u679c\u5c06\u5ffd\u89c6\u4e86\u4e00\u4e9b\u70b9\uff0c\u4f8b\u5982\u7b2c\u4e00\u4e2a\u70b9\u662f\u88ab\u89c6\u4e3a\u566a\u58f0\u7684\u6570\u636e\u6ca1\u6709\u88ab\u8003\u8651\u8fdb\u53bb\uff0c\u4ee5\u53ca\u7b2c\u4e8c\u4e2a\u70b9\u5728\u67d0\u4e00\u4e2a\u975e\u51f8\u7684\u805a\u7c7b\u7c07\u4e2d\uff0c\u8fd9\u4e2a\u7c07\u53ef\u80fd\u5f62\u6210\u7684\u5f62\u72b6\u6bd4\u8f83\u5947\u602a\uff0c\u5bfc\u81f4\u7ec4\u5185\u7684\u51dd\u805a\u529b\u6bd4\u8f83\u5dee\uff0c\u5f97\u5230\u6bd4\u8f83\u5dee\u7684\u7ed3\u679c\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u4e00\u4e9b\u8865\u5145\uff1a\u5982\u4f55\u8ba1\u7b97WCSS (From GPT4)<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># based on Euclidean distance\nwcss = sum([np.sum(np.linalg.norm(data[cluster_assignment == i]-centroid, axis=1)**2) for i, centroid in enumerate(centroids, 1)])<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">\u57fa\u4e8e\u5bc6\u5ea6\u7684\u805a\u7c7b Validation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.dbs.ifi.lmu.de\/~zimek\/publications\/SDM2014\/DBCV.pdf\">DBCV(Density-Based Clustering Validation) <\/a>\u8fd9\u7bc7\u8bba\u6587\u7684\u4f5c\u8005\u63d0\u51fa\u4e86Validity Index of a Clustering\u6765\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ DBCV(C) = \\sum_{i=1}^{i=l}\\frac{|C_i|}{|O|}V_C(C_i)$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u5176\u4e2d\u4e00\u4e2aCluster solution $C = \\{C_i, 1\\leq i\\leq l\\}$\u662f\u4e00\u4e2aweighted average of the Validity Index of all clusters in $C$\u3002\u8be6\u7ec6\u7684\u7ec6\u8282\u53ef\u4ee5\u89c1\u8bba\u6587\u7684definition\u90e8\u5206\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">\u5173\u4e8e\u57fa\u4e8e\u5bc6\u5ea6\u7684\u805a\u7c7bValidity index\u7684\u5e94\u7528<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">\u9996\u5148\u662f\u4e00\u4e2aAmazon\u5f00\u53d1\u7684\u4e00\u4e2a\u5305\uff0c\u7528\u4e8e\u8bbe\u8ba1Categorical+Numerical \u7684\u6df7\u5408\u578bdata\u805a\u7c7b\u95ee\u9898\u3002\u540d\u79f0\u4e3a<a href=\"https:\/\/github.com\/awslabs\/amazon-denseclus\">Amacon-DenseClus<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u5176\u6b21\u53ef\u4ee5\u7528\u6765\u505aHyperparameter Tuning\uff0c\u5177\u4f53\u7684\u7ed3\u679c\u4ee3\u7801\u90fd\u53ef\u89c1\u539fpo\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u672c\u6587\u5927\u90e8\u5206\u5185\u5bb9\u6765\u81ea\u4e8eHow to tune wit &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[63,41,46],"tags":[64,51],"class_list":["post-657","post","type-post","status-publish","format-standard","hentry","category-cluster","category-python","category-statistics","tag-cluster","tag-python"],"_links":{"self":[{"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/posts\/657","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/comments?post=657"}],"version-history":[{"count":8,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/posts\/657\/revisions"}],"predecessor-version":[{"id":665,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/posts\/657\/revisions\/665"}],"wp:attachment":[{"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/media?parent=657"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/categories?post=657"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wennroy.com\/index.php\/wp-json\/wp\/v2\/tags?post=657"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}