Welcome to k-means algorithm documentation!¶
Hands-on Job Sumbimission:¶
In order to make this example work, we need first to install the following::
virtualenv $HOME/myenv
source $HOME/myenv/bin/activate
Install Radical-Pilot API:
pip install radical.pilot:
Install MondoDB (only if you want to run locally):
Linux Users:
apt-get -y install scons libssl-dev libboost-filesystem-dev libboost-program-options-dev libboost-system-dev libboost-thread-dev git clone -b r2.6.3 https://github.com/mongodb/mongo.git cd mongo scons --64 --ssl all scons --64 --ssl --prefix=/usr installMac Users:
brew install mongodb mkdir -p /data/db chmod 755 /data/db mongod
Finally, you need to download the source files of k-means algorithm:
curl -O https://raw.githubusercontent.com/georgeha/k-means-algorithm/master/k-means.py
curl -O https://raw.githubusercontent.com/georgeha/k-means-algorithm/master/clustering_the_elements.py
curl -O https://raw.githubusercontent.com/georgeha/k-means-algorithm/master/finding_new_centroids.py
curl -O https://raw.githubusercontent.com/georgeha/k-means-algorithm/master/dataset4.data
Run the Code:¶
To give it a test drive try via command line the following command:
python k-means.py 3
where 3 is the number of clusters the user wants to create.
More About this algorithm:¶
This algorithm creates the clusters of the elements found in the dataset2.data file. You can create your own file or create a new dataset file using the following generator:
curl -O https://raw.githubusercontent.com/georgeha/k-means-algorithm/master/creating_dataset.py
run via command line:
python creating_dataset.py (number_of_elements)
The algorithm takes the elements from the dataset2.data file. Then, it chooses the first k centroids using the quickselect algorithm. It divides into number_of_cores files the initial file and pass each file as an argument to each Compute Unit. Every Compute Unit return the elements into the correct centroid_cu_k.data file and then it composes the files into k centroid_k.data files. In order to find the new centroids, we pass each centroid_k file as an argument to each Compute Unit, and in every compute unit the calculate the new centroids. If we have convergence we stop the algorithm, otherwise we start a new iteration.