Package 'ktaucenters' reference manual

Title:	Robust Clustering Procedures
Description:	A clustering algorithm similar to K-Means is implemented, it has two main advantages, namely (a) The estimator is resistant to outliers, that means that results of estimator are still correct when there are atypical values in the sample and (b) The estimator is efficient, roughly speaking, if there are no outliers in the sample, results will be similar to those obtained by a classic algorithm (K-Means). Clustering procedure is carried out by minimizing the overall robust scale so-called tau scale. (see Gonzalez, Yohai and Zamar (2019) <arxiv:1906.08198>).
Authors:	Juan Domingo Gonzalez [cre, aut], Victor J. Yohai [aut], Ruben H. Zamar [aut], Douglas Alberto Carmona Guanipa [aut]
Maintainer:	Juan Domingo Gonzalez <[email protected]>
License:	GPL (>= 2)
Version:	1.0.0
Built:	2025-02-14 04:26:47 UTC
Source:	https://github.com/jdgonzalezwork/ktaucenters

Distance Matrix Computation

Description

Computes and returns the distance matrix using euclidean distance measure to compute the distances between the rows of a data matrix.

Usage

.distance(x)
.distance(x)

Arguments

`x`	a numeric matrix.

Value

A numeric matrix with the distances between the rows of a matrix.

Flag outliers

Description

Flag outliers

Usage

.flag_outliers(cutoff, b, ktau)
.flag_outliers(cutoff, b, ktau)

Arguments

`cutoff`	quantile of chi-square to be used as a threshold for outliers detection.
`b`	break down point.
`ktau`	ktaucenters results.

Value

Numeric vector with the weight factor for each observation

Robust Clustering algorithm based on centers, a robust and efficient version of kmeans.

Description

Robust Clustering algorithm based on centers, a robust and efficient version of kmeans.

Usage

.ktaucenters_run(x, centers, tolerance, max_iter)
.ktaucenters_run(x, centers, tolerance, max_iter)

Arguments

`x`	numeric matrix of size n x p with all observations.
`centers`	numeric matrix with initial cluster centers.
`tolerance`	maximum difference between current and new computed clusters. Parameter used for the algorithm stopping rule.
`max_iter`	a maximum number of iterations used for the algorithm stopping rule.

Value

A list with the following components:

`tau`	$\tau$ scale value.
`iter`	number of iterations until convergence is achieved or maximum number of iteration is reached.
`di`	distance of each observation to its nearest cluster center.
`centers`	numeric matrix of size K x p, with the estimated K centers.
`clusters`	integer vector of size n with the cluster location for each observation.

References

[1] Gonzalez, J. D., Yohai, V. J., & Zamar, R. H. (2019). Robust Clustering Using Tau-Scales. arXiv preprint arXiv:1906.08198.

[2] Maronna, R. A. and Yohai, V. J. (2017). Robust and efficient estimation of multivariate scatter and location.Computational Statistics &Data Analysis, 109 : 64–75.

Second derivative of the quasi $\rho$ function

Description

Second derivative of the quasi $\rho$ function

Usage

derpsiOpt(x, cc)
derpsiOpt(x, cc)

Arguments

`x`	numeric vector with positive values.
`cc`	tunning constant.

Value

Numeric vector with the second derivative of the quasi optimal $\rho$ computation for each element of x.

improvedktaucenters

Description

Robust Clustering algorithm for non-spherical data. This function estimate clusters taking into account that clusters may have different size, volume or orientation.

Usage

improvedktaucenters(X, K, cutoff = 0.999, nstart = 5, INITcenters = NULL)
improvedktaucenters(X, K, cutoff = 0.999, nstart = 5, INITcenters = NULL)

Arguments

`X`	numeric matrix of size n x p.
`K`	number of clusters.
`cutoff`	argument for outliers detection - quantiles of chi-square to be used as a threshold for outliers detection, defaults to 0.999.
`nstart`	number of trials that the base ktaucenters is run at the first stage. If it is greater than 1 and center is not set as NULL, a random set of (distinct) rows in x is chosen as the initial centres for each trial.
`INITcenters`	numeric matrix of size K x p indicating the initial centers for that clusters and robust covariance matrices will be computed, if it is set as NULL the algorithm will compute from ktaucenters routine. Set to NULL by default.

Value

A list with the following components:

`centers`	: Matrix of size K x p, with the estimated K centers.
`cluster`	: A vector of integer (from 1:k) indicating the cluster to which each point is allocated.
`sigmas`	: A list containing the k covariance matrices found by the procedure at its second step.
`outliers`	: indices observation that can be considered as outliers.

References

Gonzalez, J. D., Yohai, V. J., & Zamar, R. H. (2019). Robust Clustering Using Tau-Scales. arXiv preprint arXiv:1906.08198.

Examples


# Generate synthetic data (three normal cluster in two dimensions)
# Clusters have different shapes and orientation.
# The data is contaminated uniformly (level 20%).

# Generates base clusters
set.seed(1)
Z1 <- c(rnorm(100, 0), rnorm(100, 0), rnorm(100, 0))
Z2 <- rnorm(300)
X <- matrix(0, ncol = 2, nrow = 300)
X[, 1] <- Z1
X[, 2] <- Z2
true.cluster <- c(rep(1, 100), rep(2, 100), rep(3, 100))

# Rotate, expand and translate base clusters
theta <- pi/3
aux1 <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), nrow = 2)
aux2 <- sqrt(4) * diag(c(1, 1/4))
B <- aux1 %*% aux2 %*% t(aux1)
X[true.cluster == 3, ] <-
  X[true.cluster == 3, ] %*% aux2 %*% aux1 + matrix(c(5, 2),
                                                  byrow = TRUE,
                                                  nrow = 100,
                                                  ncol = 2)
X[true.cluster == 2, 2] <- X[true.cluster == 2, 2] * 5
X[true.cluster == 1, 2] <- X[true.cluster == 1, 2] * 0.1
X[true.cluster == 1, ] <- X[true.cluster == 1, ] + matrix(c(-5, -1),
                                                          byrow = TRUE,
                                                          nrow = 100,
                                                          ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)

outliers <- sample(1:300, 60)
X[outliers, ] <- matrix(runif( 40, 2 * min(X), 2 * max(X) ),
                                ncol = 2, nrow = 60)

# Applying the algorithm
robust <- improvedktaucenters(X, K = 3, cutoff = 0.999)

# Plotting results
oldpar <- par(mfrow = c(2, 1))
plot(X, main = "Actual clusters")
for (j in 1:3){
 points(X[true.cluster == j, ], pch = 19, col = j + 1)
}
points(X[outliers, ], pch = 19, col = 1)
plot(X, main = "Clusters estimation")
for (j in 1:3){
 points(X[robust$cluster == j,], pch = 19, col = j + 1)
}
points(X[robust$outliers, ], pch = 19)

par(oldpar)
# Generate synthetic data (three normal cluster in two dimensions)
# Clusters have different shapes and orientation.
# The data is contaminated uniformly (level 20%).

# Generates base clusters
set.seed(1)
Z1 <- c(rnorm(100, 0), rnorm(100, 0), rnorm(100, 0))
Z2 <- rnorm(300)
X <- matrix(0, ncol = 2, nrow = 300)
X[, 1] <- Z1
X[, 2] <- Z2
true.cluster <- c(rep(1, 100), rep(2, 100), rep(3, 100))

# Rotate, expand and translate base clusters
theta <- pi/3
aux1 <- matrix(c(cos(theta), -sin(theta), sin(theta), cos(theta)), nrow = 2)
aux2 <- sqrt(4) * diag(c(1, 1/4))
B <- aux1 %*% aux2 %*% t(aux1)
X[true.cluster == 3, ] <-
  X[true.cluster == 3, ] %*% aux2 %*% aux1 + matrix(c(5, 2),
                                                  byrow = TRUE,
                                                  nrow = 100,
                                                  ncol = 2)
X[true.cluster == 2, 2] <- X[true.cluster == 2, 2] * 5
X[true.cluster == 1, 2] <- X[true.cluster == 1, 2] * 0.1
X[true.cluster == 1, ] <- X[true.cluster == 1, ] + matrix(c(-5, -1),
                                                          byrow = TRUE,
                                                          nrow = 100,
                                                          ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)

outliers <- sample(1:300, 60)
X[outliers, ] <- matrix(runif( 40, 2 * min(X), 2 * max(X) ),
                                ncol = 2, nrow = 60)

# Applying the algorithm
robust <- improvedktaucenters(X, K = 3, cutoff = 0.999)

# Plotting results
oldpar <- par(mfrow = c(2, 1))
plot(X, main = "Actual clusters")
for (j in 1:3){
 points(X[true.cluster == j, ], pch = 19, col = j + 1)
}
points(X[outliers, ], pch = 19, col = 1)
plot(X, main = "Clusters estimation")
for (j in 1:3){
 points(X[robust$cluster == j,], pch = 19, col = j + 1)
}
points(X[robust$outliers, ], pch = 19)

par(oldpar)

ktaucenters

Description

Robust and efficient version of Kmeans algorithm for clustering based on centers.

Usage

ktaucenters(
  X,
  K,
  centers = NULL,
  tolmin = 1e-06,
  NiterMax = 100,
  nstart = 1,
  startWithKmeans = TRUE,
  startWithROBINPD = TRUE,
  cutoff = 0.999
)
ktaucenters(
  X,
  K,
  centers = NULL,
  tolmin = 1e-06,
  NiterMax = 100,
  nstart = 1,
  startWithKmeans = TRUE,
  startWithROBINPD = TRUE,
  cutoff = 0.999
)

Arguments

`X`	numeric matrix of size n x p.
`K`	number of clusters.
`centers`	a matrix of size K x p containing the K initial centers, one at each matrix-row. If centers is NULL a random set of (distinct) rows in `X` are chosen as the initial centers.
`tolmin`	a tolerance parameter used for the algorithm stopping rule.
`NiterMax`	a maximum number of iterations used for the algorithm stopping rule.
`nstart`	the number of trials that the base algorithm is run. If it is greater than 1 and centers is not set as NULL, a random set of (distinct) rows in `X` will be chosen as the initial centers.
`startWithKmeans`	if positive (or true) kmeans estimated centers are included as starting point.
`startWithROBINPD`	if positive (or true) ROBINDEN estimated centers are included as starting point.
`cutoff`	optional argument for outliers detection - quantiles of chi-square to be used as a threshold for outliers detection, defaults to 0.999.

Value

A list with the following components:

`centers`	: Matrix of size K x p with the estimated K centers.
`cluster`	: A vector of integer (from 1:K) indicating the cluster to which each point is allocated.
`iter`	: Number of iterations until convergence is achieved or maximum number of iterations reached.
`di`	: Distance of each observation to its assigned cluster-center.
`outliers`	: A vector of integers with indices for each observation considered as outlier.

References

Gonzalez, J. D., Yohai, V. J., & Zamar, R. H. (2019). Robust Clustering Using Tau-Scales. arXiv preprint arXiv:1906.08198.

Examples

# Generate synthetic data (three clusters well separated)
Z <- rnorm(600)
mues <- rep(c(-3, 0, 3), 200)
X <- matrix(Z + mues, ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 3 * min(X), 3 * max(X) ),
                                ncol = 2, nrow = 60)

robust <- ktaucenters(
     X, K = 3, centers = X[sample(1:300, 3), ],
     tolmin = 1e-3, NiterMax = 100)

oldpar <- par(mfrow = c(1, 2))

plot(X,type = "n", main = "ktaucenters (Robust) \n outliers: solid black dots")
points(X[robust$cluster == 1, ], col = 2)
points(X[robust$cluster == 2, ], col = 3)
points(X[robust$cluster == 3, ], col = 4)
points(X[robust$outliers, 1], X[robust$outliers, 2], pch = 19)

# Classical (non Robust) algorithm
non_robust <- kmeans(X, centers = 3, nstart = 100)

plot(X, type = "n", main = "kmeans (Classical)")
points(X[non_robust$cluster == 1, ], col = 2)
points(X[non_robust$cluster == 2, ], col = 3)
points(X[non_robust$cluster == 3, ], col = 4)

par(oldpar)
# Generate synthetic data (three clusters well separated)
Z <- rnorm(600)
mues <- rep(c(-3, 0, 3), 200)
X <- matrix(Z + mues, ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 3 * min(X), 3 * max(X) ),
                                ncol = 2, nrow = 60)

robust <- ktaucenters(
     X, K = 3, centers = X[sample(1:300, 3), ],
     tolmin = 1e-3, NiterMax = 100)

oldpar <- par(mfrow = c(1, 2))

plot(X,type = "n", main = "ktaucenters (Robust) \n outliers: solid black dots")
points(X[robust$cluster == 1, ], col = 2)
points(X[robust$cluster == 2, ], col = 3)
points(X[robust$cluster == 3, ], col = 4)
points(X[robust$outliers, 1], X[robust$outliers, 2], pch = 19)

# Classical (non Robust) algorithm
non_robust <- kmeans(X, centers = 3, nstart = 100)

plot(X, type = "n", main = "kmeans (Classical)")
points(X[non_robust$cluster == 1, ], col = 2)
points(X[non_robust$cluster == 2, ], col = 3)
points(X[non_robust$cluster == 3, ], col = 4)

par(oldpar)

ktaucentersfast

Description

Robust and efficient version of Kmeans algorithm for clustering based on centers.

Usage

ktaucentersfast(
  x,
  centers,
  nstart = 1L,
  use_kmeans = TRUE,
  use_robin = TRUE,
  max_iter = 100L,
  max_tol = 1e-06,
  cutoff = 0.999
)
ktaucentersfast(
  x,
  centers,
  nstart = 1L,
  use_kmeans = TRUE,
  use_robin = TRUE,
  max_iter = 100L,
  max_tol = 1e-06,
  cutoff = 0.999
)

Arguments

`x`	numeric matrix of size n x p, or an object that can be coerced to a matrix (such as a numeric vector or a data frame with all numeric columns).
`centers`	either the number of clusters, say k, or a matrix of initial (distinct) cluster centers. If a number, a random set of distinct rows in `x` is chosen as the initial centers.
`nstart`	if centers is a number, how many random sets should be chosen?
`use_kmeans`	use kmeans centers as starting point?
`use_robin`	use robin algorithm centers as starting point?
`max_iter`	the maximum number of iterations allowed.
`max_tol`	maximum tolerance parameter used for the algorithm as stopping rule.
`cutoff`	quantile of chi-square distribution to be used as a threshold for outliers detection, defaults to 0.999.

Value

A list with the following components:

`centers`	: A matrix of cluster centers.
`cluster`	: A vector of integer (from 1:k) indicating the cluster to which each point is allocated.
`tau`	: $\tau$ scale value.
`iter`	: Number of iterations until convergence is achieved or maximum number of iteration reached.
`di`	: Distance of each observation to its assigned cluster-center
`outliers`	: A vector of integers with indices for each observation considered as outlier.

References

Gonzalez, J. D., Yohai, V. J., & Zamar, R. H. (2019). Robust Clustering Using Tau-Scales. arXiv preprint arXiv:1906.08198.

Examples

# Generate synthetic data (three clusters well separated)
Z <- rnorm(600)
mues <- rep(c(-3, 0, 3), 200)
X <- matrix(Z + mues, ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 3 * min(X), 3 * max(X) ),
                                ncol = 2, nrow = 60)

robust <- ktaucentersfast(
     X, centers = X[sample(1:300, 3), ],
     max_tol = 1e-3, max_iter = 100)

oldpar <- par(mfrow = c(1, 2))

plot(X,type = "n", main = "ktaucenters (Robust) \n outliers: solid black dots")
points(X[robust$cluster == 1, ], col = 2)
points(X[robust$cluster == 2, ], col = 3)
points(X[robust$cluster == 3, ], col = 4)
points(X[robust$outliers, 1], X[robust$outliers, 2], pch = 19)

# Classical (non Robust) algorithm
non_robust <- kmeans(X, centers = 3, nstart = 100)

plot(X, type = "n", main = "kmeans (Classical)")
points(X[non_robust$cluster == 1, ], col = 2)
points(X[non_robust$cluster == 2, ], col = 3)
points(X[non_robust$cluster == 3, ], col = 4)

par(oldpar)
# Generate synthetic data (three clusters well separated)
Z <- rnorm(600)
mues <- rep(c(-3, 0, 3), 200)
X <- matrix(Z + mues, ncol = 2)

# Generate 60 synthetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 3 * min(X), 3 * max(X) ),
                                ncol = 2, nrow = 60)

robust <- ktaucentersfast(
     X, centers = X[sample(1:300, 3), ],
     max_tol = 1e-3, max_iter = 100)

oldpar <- par(mfrow = c(1, 2))

plot(X,type = "n", main = "ktaucenters (Robust) \n outliers: solid black dots")
points(X[robust$cluster == 1, ], col = 2)
points(X[robust$cluster == 2, ], col = 3)
points(X[robust$cluster == 3, ], col = 4)
points(X[robust$outliers, 1], X[robust$outliers, 2], pch = 19)

# Classical (non Robust) algorithm
non_robust <- kmeans(X, centers = 3, nstart = 100)

plot(X, type = "n", main = "kmeans (Classical)")
points(X[non_robust$cluster == 1, ], col = 2)
points(X[non_robust$cluster == 2, ], col = 3)
points(X[non_robust$cluster == 3, ], col = 4)

par(oldpar)

Intensity and saturation values of a picture from mars.

Description

A dataset containing the Intensity and Saturation values of a picture from Mars taken from Rover Curiosity.

Usage

mars_screw
mars_screw

Format

A list containing information about pixels of a picture form mars mainly containing red sand and metal form Rover itself. List include:

SI_matrix: A matrix with 5063 rows and 128 columns. Elements 1 to 64 of each row indicate the Saturation values of pixels in a square cell 8 x 8 whereas elements 65 to 128 of each row indicate the cell's Intensity values.
geographic_matrix: An integer matrix of dimension 5063 x 2, each row indicates each square cell's locations (x-axis y-axis) at the picture.
screw_index: the index corresponding to the screw observation (screw_index=4180)

Source

https://www.nasa.gov/wp-content/uploads/2023/03/694811main_pia16225-43_full.jpg

M scale

Description

The M scale of an univariate sample.

Usage

Mscale(u, c, b)
Mscale(u, c, b)

Arguments

`u`	numeric vector with positive values.
`c`	a tuning constant. If consistency to standard normal distribution is desired use `normal_consistency_constants`.
`b`	the desired break down point.

Value

M scale value.

References

Maronna, R. A., Martin, R. D., Yohai, V. J., & Salibian-Barrera, M. (2018). Robust statistics: theory and methods (with R). Wiley.

Examples

Mscale(u = rnorm(100), c = 1, b = 0.5)

Mscale(u = rnorm(100), c = 1, b = 0.5)

Normal Consistency Constants

Description

M scale tuning constants so it is consistent with the standard normal distribution for the quasi optimal $\rho$ function used in rhoOpt. These constants were computed for 1 $\leq$ p $\leq$ 400.

Usage

normal_consistency_constants(p)
normal_consistency_constants(p)

Arguments

`p`	dimension where observation lives.

Value

tuning constant.

References

[1] Maronna, R. A., Martin, R. D., Yohai, V. J., & Salibián-Barrera, M. (2018). 'Robust statistics: theory and methods (with ' R). Wiley.

[2] Salibian-Barrera, M., Willems, G., & Zamar, R. (2008). The fast-tau estimator for regression. 'Journal of Computational and Graphical Statistics, 17(3), 659-682.

Derivative of the quasi optimal $\rho$ function

Description

Derivative of the quasi optimal $\rho$ function

Usage

psiOpt(x, cc)
psiOpt(x, cc)

Arguments

`x`	numeric vector with positive values.
`cc`	tunning constant.

Value

Numeric vector with the derivative of the quasi optimal $\rho$ computation for each element of x.

Quasi optimal $\rho$ function

Description

Quasi optimal $\rho$ function

Usage

rhoOpt(x, cc)
rhoOpt(x, cc)

Arguments

`x`	numeric vector with positive values.
`cc`	tunning constant.

Value

Numeric vector with quasi optimal $\rho$ computation for each element of x.

References

[1] Salibian-Barrera, M., Willems, G., & Zamar, R. (2008). The fast-tau estimator for regression. Journal of Computational and GraphicalStatistics, 17(3), 659-682.

Robust Initialization based on Inverse Density estimator (ROBINDEN)

Description

Searches for k initial cluster seeds for k-means based clustering methods.

Usage

robinden(D, n_clusters, mp)
robinden(D, n_clusters, mp)

Arguments

`D`	a distance matrix, which contains the distances between the rows of a matrix.
`n_clusters`	number of cluster centers to find.
`mp`	number of nearest neighbors to compute point density.

Details

The centers are the observations located in the most dense region and far away from each other at the same time. In order to find the observations in the highly dense region, this function uses point density estimation (instead of Local Outlier Factor, Breunig et al (2000)), see more details.

Value

A list with the following components:

`centers`	: A numeric vector with the initial cluster centers indices.
`idpoints`	: A real vector containing the inverse of point density estimation.

Note

This is a slightly modified version of ROBIN algorithm implementation done by Sarka Brodinova <[email protected]>.

Author(s)

Juan Domingo Gonzalez <[email protected]>

References

Hasan AM, et al. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11), 994-1002, 2009.

Examples

# Generate synthetic data (7 cluster well separated)
K <- 5
nk <- 100
Z <- rnorm(2 * K * nk)
mues <- rep(5 * -floor(K/2):floor(K/2), 2 * nk * K)
X <-  matrix(Z + mues, ncol = 2)

# Generate synthetic outliers (contamination level 20%)
X[sample(1:(nk * K), (nk * K) * 0.2), ] <-
  matrix(runif((nk * K) * 0.2 * 2, 3 * min(X), 3 * max(X)),
         ncol = 2,
         nrow = (nk * K)* 0.2)
res <- robinden(D = as.matrix(dist(X)), n_clusters = K, mp = 10);
# plot the Initial centers found
plot(X)
points(X[res$centers, ], pch = 19, col = 4, cex = 2)

# Generate synthetic data (7 cluster well separated)
K <- 5
nk <- 100
Z <- rnorm(2 * K * nk)
mues <- rep(5 * -floor(K/2):floor(K/2), 2 * nk * K)
X <-  matrix(Z + mues, ncol = 2)

# Generate synthetic outliers (contamination level 20%)
X[sample(1:(nk * K), (nk * K) * 0.2), ] <-
  matrix(runif((nk * K) * 0.2 * 2, 3 * min(X), 3 * max(X)),
         ncol = 2,
         nrow = (nk * K)* 0.2)
res <- robinden(D = as.matrix(dist(X)), n_clusters = K, mp = 10);
# plot the Initial centers found
plot(X)
points(X[res$centers, ], pch = 19, col = 4, cex = 2)

Package 'ktaucenters'

Help Index

Distance Matrix Computation

Description

Usage

Arguments

Value

Flag outliers

Description

Usage

Arguments

Value

Robust Clustering algorithm based on centers, a robust and efficient version of kmeans.

Description

Usage

Arguments

Value

References

Second derivative of the quasi ρ\rhoρ function

Description

Usage

Arguments

Value

improvedktaucenters

Description

Usage

Arguments

Value

References

Examples

ktaucenters

Description

Usage

Arguments

Value

References

Examples

ktaucentersfast

Description

Usage

Arguments

Value

References

Examples

Intensity and saturation values of a picture from mars.

Description

Usage

Format

Source

M scale

Description

Usage

Arguments

Value

References

Examples

Normal Consistency Constants

Description

Usage

Arguments

Value

References

Derivative of the quasi optimal ρ\rhoρ function

Description

Usage

Arguments

Value

Quasi optimal ρ\rhoρ function

Description

Usage

Arguments

Value

References

Robust Initialization based on Inverse Density estimator (ROBINDEN)

Description

Usage

Arguments

Details

Value

Note

Second derivative of the quasi $\rho$ function

Derivative of the quasi optimal $\rho$ function

Quasi optimal $\rho$ function