Clustering database

This is not totally correct. If you want to use this, contact us and we will update it to match the current operation of the program.

Unsupervised classification scheme using a k-means algorithm.

You can also do an unsupervised classification of satellite imagery using the same algorithm.

Select fields for clustering, which will be all fields not hidden.  You want to at least remove LAT, LONG, MASK.

If  fields have a great range in the degree and amount of variability, you could normalize them first (option on the "EDIT" button of the database table form). Otherwise distances will  largely depend on the variable with the largest range.

Select clustering options:
  A "CLUSTER" field will be added to the database, if it does not exist.  Values with be cluster number, from 1 to the maximum you specified.  Map the results with the Color code by DB field option on "Plot" button menu on the database table display window.
You can get 2D scatterplots, color coded by assigned clusters.

The fields are arrayed in the rows and columns.  Each graph is displayed twice, but the ability to look at each variable on the same axis helps in the interpretation.

This can be a slowdown with a lot of clusters.

If you have masked the database and created a MASK field (such as using the irregular area option on the Map query button of the database table display, you can get a scatterplot with coloring from the MASK field.  Those points will be colored in red, and other points will be in gray.
Histograms, with each cluster a separate color.
Histograms, with the MASK points in a different color from the non-mask points.

 

If you have more than 5 selected parameters, you will have an option to search for the best set of 5 parameters.

If you select the option, you will get no graphical output but will get this text file which you should save and import into something like Excel for analysis.

The columns will be:

  • Five parameters used.
  • The number of clusters found by the algorithm.
  • The number of clusters which contain points in the MASK.
  • The number of MASK points in the cluster with the most MASK points.
  • The percentage of MASK points in the cluster with the largest number of MASK points.  The higher this value, the better the cluster did at isolating MASK points from others.  It is (Number of MASK Points in Cluster) / (Total points in the cluster).
  • The percentage of all MASK points located in this cluster.  The higher this value, the better the cluster did at combining all the MASK points in a single cluster.  It is (Number of MASK Points in Cluster) / (All MASK points).

The "best" set of parameters would produce high values in the last two columns.  You can sort by column in Excel (insure that you expand the selection to sort the entire data set, and not just the single column.

 

Clustering Delphi code from Fred Edberg; 11/30/02 (fedberg@teleport.com) and uses K mean clustering.


Last revision 6/13/2015