java.lang.Object
org.apache.lucene.sandbox.codecs.quantization.KMeans
KMeans clustering algorithm for vectors
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enum
Kmeans initialization methodsstatic final record
Results of KMeans clustering -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
static final int
static final int
private final KMeans.KmeansInitializationMethod
private final int
static final int
private final int
private final int
private final Random
private final int
private final FloatVectorValues
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprivate
KMeans
(FloatVectorValues vectors, int numCentroids, Random random, KMeans.KmeansInitializationMethod initializationMethod, int restarts, int iters) -
Method Summary
Modifier and TypeMethodDescription(package private) static void
assignCentroids
(FloatVectorValues vectors, float[][] centroids, List<Integer> unassignedCentroidsIdxs) For centroids that did not get any points, assign outlying points to them chose points by descending distance to the current centroid setstatic KMeans.Results
cluster
(FloatVectorValues vectors, int numClusters, boolean assignCentroidsToVectors, long seed, KMeans.KmeansInitializationMethod initializationMethod, boolean normalizeCenters, int restarts, int iters, int sampleSize) Expert: Cluster vectors into a given number of clustersstatic KMeans.Results
cluster
(FloatVectorValues vectors, VectorSimilarityFunction similarityFunction, int numClusters) Cluster vectors into a given number of clustersprivate float[][]
computeCentroids
(boolean normalizeCenters) private float[][]
Initialize centroids using Forgy method: randomly select numCentroids vectors for initial centroidsprivate float[][]
Initialize centroids using Kmeans++ methodprivate float[][]
Initialize centroids using a reservoir sampling methodprivate static double
runKMeansStep
(FloatVectorValues vectors, float[][] centroids, short[] docCentroids, boolean useKahanSummation, boolean normalizeCentroids) Run kmeans step
-
Field Details
-
MAX_NUM_CENTROIDS
public static final int MAX_NUM_CENTROIDS- See Also:
-
DEFAULT_RESTARTS
public static final int DEFAULT_RESTARTS- See Also:
-
DEFAULT_ITRS
public static final int DEFAULT_ITRS- See Also:
-
DEFAULT_SAMPLE_SIZE
public static final int DEFAULT_SAMPLE_SIZE- See Also:
-
vectors
-
numVectors
private final int numVectors -
numCentroids
private final int numCentroids -
random
-
initializationMethod
-
restarts
private final int restarts -
iters
private final int iters
-
-
Constructor Details
-
KMeans
private KMeans(FloatVectorValues vectors, int numCentroids, Random random, KMeans.KmeansInitializationMethod initializationMethod, int restarts, int iters)
-
-
Method Details
-
cluster
public static KMeans.Results cluster(FloatVectorValues vectors, VectorSimilarityFunction similarityFunction, int numClusters) throws IOException Cluster vectors into a given number of clusters- Parameters:
vectors
- float vectorssimilarityFunction
- vector similarity function. For COSINE similarity, vectors must be normalized.numClusters
- number of cluster to cluster vector into- Returns:
- results of clustering: produced centroids and for each vector its centroid
- Throws:
IOException
- when if there is an error accessing vectors
-
cluster
public static KMeans.Results cluster(FloatVectorValues vectors, int numClusters, boolean assignCentroidsToVectors, long seed, KMeans.KmeansInitializationMethod initializationMethod, boolean normalizeCenters, int restarts, int iters, int sampleSize) throws IOException Expert: Cluster vectors into a given number of clusters- Parameters:
vectors
- float vectorsnumClusters
- number of cluster to cluster vector intoassignCentroidsToVectors
- iftrue
assign centroids for all vectors. Centroids are computed on a sample of vectors. If this parameter istrue
, in results also return for all vectors what centroids they belong to.seed
- random seedinitializationMethod
- Kmeans initialization methodnormalizeCenters
- for cosine distance, set to true, to use spherical k-means where centers are normalizedrestarts
- how many times to run Kmeans algorithmiters
- how many iterations to do within a single runsampleSize
- sample size to select from all vectors on which to run Kmeans algorithm- Returns:
- results of clustering: produced centroids and if
assignCentroidsToVectors == true
also for each vector its centroid - Throws:
IOException
- if there is error accessing vectors
-
computeCentroids
- Throws:
IOException
-
initializeForgy
Initialize centroids using Forgy method: randomly select numCentroids vectors for initial centroids- Throws:
IOException
-
initializeReservoirSampling
Initialize centroids using a reservoir sampling method- Throws:
IOException
-
initializePlusPlus
Initialize centroids using Kmeans++ method- Throws:
IOException
-
runKMeansStep
private static double runKMeansStep(FloatVectorValues vectors, float[][] centroids, short[] docCentroids, boolean useKahanSummation, boolean normalizeCentroids) throws IOException Run kmeans step- Parameters:
vectors
- float vectorscentroids
- centroids, new calculated centroids are written heredocCentroids
- for each document which centroid it belongs to, results will be written hereuseKahanSummation
- for large datasets use Kahan summation to calculate centroids, since we can easily reach the limits of float precisionnormalizeCentroids
- if centroids should be normalized; used for cosine similarity only- Throws:
IOException
- if there is an error accessing vector values
-
assignCentroids
static void assignCentroids(FloatVectorValues vectors, float[][] centroids, List<Integer> unassignedCentroidsIdxs) throws IOException For centroids that did not get any points, assign outlying points to them chose points by descending distance to the current centroid set- Throws:
IOException
-