difference between correlation and independence, k-means clustering pros and cons
Sigiloso
correlation : e[xy] - e[x]e[y], can be zero even if variables are not independent, usually can set up tricky rv that satisfies this independence: p(x,y) = p(x)p(y). automatically implies 0 cov (plug it in) k-means: pros: good when you know number of distinct clusters without too much overlap between. run-time calculation is p fast, just compare to centoids O(num_means * num_dimension). interpretable and can use custom distance functions. cons: needs distance function, hard when data is on differing magnitudes. training is always approximation, has to be trained, optimal solution is np-hard. training doesn't always converge, bad initial points can make clusters bad, hard to tell how many clusters is sufficient, cannot model complex clusters (think clusters of concentric rings)