| Title: | Investigating New Projection Pursuit Index Functions |
|---|---|
| Description: | Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. The 'spinebil' package contains methods to evaluate the performance of projection pursuit index functions using tour methods. A paper describing the methods can be found at <doi:10.1007/s00180-020-00954-8>. |
| Authors: | Ursula Laa [aut] (ORCID: <https://orcid.org/0000-0002-0249-6439>), Dianne Cook [aut] (ORCID: <https://orcid.org/0000-0002-3813-7155>), Tina Rashid Jafari [aut, cre] (ORCID: <https://orcid.org/0009-0008-3605-5341>) |
| Maintainer: | Tina Rashid Jafari <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.5 |
| Built: | 2026-05-13 05:24:06 UTC |
| Source: | https://github.com/uschilaa/spinebil |
Generate 2-d basis in directions i, j in n dimensions (i,j <= n)
basis_matrix(i, j, n)basis_matrix(i, j, n)
i |
first basis direction |
j |
second basis direction |
n |
number of dimensions |
basis matrix
Generate basis vector in direction i in n dimensions (i <= n)
basis_vector(i, n)basis_vector(i, n)
i |
selected direction |
n |
number of dimensions |
basis vector
Compare traces with different smoothing options.
compare_smoothing(d, tPath, idx, alphaV = c(0.01, 0.05, 0.1), n = 10)compare_smoothing(d, tPath, idx, alphaV = c(0.01, 0.05, 0.1), n = 10)
d |
Data matrix |
tPath |
Interpolated tour path (as list of projections) |
idx |
Index function |
alphaV |
Jitter amounts to compare (for jittering angle or points) |
n |
Number of evaluations entering mean value calculation |
Table of mean index values
d <- as.matrix(spiral_data(30, 3)) tPath <- tourr::save_history(d, max_bases=2) tPath <- as.list(tourr::interpolate(tPath, 0.3)) idx <- scag_index("stringy") compS <- compare_smoothing(d, tPath, idx, alphaV = c(0.01, 0.05), n=2) plot_smoothing_comparison(compS)d <- as.matrix(spiral_data(30, 3)) tPath <- tourr::save_history(d, max_bases=2) tPath <- as.list(tourr::interpolate(tPath, 0.3)) idx <- scag_index("stringy") compS <- compare_smoothing(d, tPath, idx, alphaV = c(0.01, 0.05), n=2) plot_smoothing_comparison(compS)
Generates either:
Structured (x, y) scatter data (linear, sine, circle, etc.), or
A matrix of scaled orthogonal polynomial features.
data_gen(type = "all", n = 500, degree = NULL, seed = NULL)data_gen(type = "all", n = 500, degree = NULL, seed = NULL)
type |
Character string. Options:
|
n |
Integer. Number of samples to generate. Default is 500. |
degree |
Integer. Degree of polynomial features (only for |
seed |
Optional integer. Sets random seed for reproducibility. |
If type = "polynomial", returns a matrix (n x degree).
Otherwise a tibble with columns:
x: Numeric vector of x-values
y: Numeric vector of y-values
structure: Character name of the structure type
data_gen("linear", n = 200) data_gen("polynomial", degree = 4, n = 200) data_gen("all", n = 200)data_gen("linear", n = 200) data_gen("polynomial", degree = 4, n = 200) data_gen("all", n = 200)
The distribution of all pairwise distances is useful to understand the optimisation in a guided tour, to compare e.g. different optimisation methods or different number of noise dimensions.
distance_dist(planes, nn = FALSE)distance_dist(planes, nn = FALSE)
planes |
Input planes (e.g. result of guided tour) |
nn |
Set true to only consider nearest neighbour distances (dummy, not yet implemented) |
numeric vector containing all distances
planes1 <- purrr::rerun(10, tourr::basis_random(5)) planes2 <- purrr::rerun(10, tourr::basis_random(10)) d1 <- distance_dist(planes1) d2 <- distance_dist(planes2) d <- tibble::tibble(dist=c(d1, d2), dim=c(rep(5,length(d1)),rep(10,length(d2)))) ggplot2::ggplot(d) + ggplot2::geom_boxplot(ggplot2::aes(factor(dim), dist))planes1 <- purrr::rerun(10, tourr::basis_random(5)) planes2 <- purrr::rerun(10, tourr::basis_random(10)) d1 <- distance_dist(planes1) d2 <- distance_dist(planes2) d <- tibble::tibble(dist=c(d1, d2), dim=c(rep(5,length(d1)),rep(10,length(d2)))) ggplot2::ggplot(d) + ggplot2::geom_boxplot(ggplot2::aes(factor(dim), dist))
If the optimal view is known, we can use the distance between a given plane and the optimal one as a proxy to diagnose the performance of the guided tour.
distance_to_sp(planes, special_plane)distance_to_sp(planes, special_plane)
planes |
Input planes (e.g. result of guided tour) |
special_plane |
Plane defining the optimal view |
numeric vector containing all distances
planes <- purrr::rerun(10, tourr::basis_random(5)) special_plane <- basis_matrix(1,2,5) d <- distance_to_sp(planes, special_plane) plot(d)planes <- purrr::rerun(10, tourr::basis_random(5)) special_plane <- basis_matrix(1,2,5) d <- distance_to_sp(planes, special_plane) plot(d)
Evaluate mean index value over n jittered views.
get_index_mean(proj, d, alpha, idx, method = "jitter_angle", n = 10)get_index_mean(proj, d, alpha, idx, method = "jitter_angle", n = 10)
proj |
Original projection plane |
d |
Data matrix |
alpha |
Jitter amount (for jittering angle or points) |
idx |
Index function |
method |
Select between "jitterAngle" (default) and "jitterPoints" (otherwise we return original index value) |
n |
Number of evaluations entering mean value calculation |
Mean index value
Tracing is used to test if the index value varies smoothly over an interpolated tour path. The index value is calculated for the data d in each projection in the interpolated sequence. Note that all index functions must take the data in 2-d matrix format and return the index value.
get_trace(d, m, index_list, index_labels)get_trace(d, m, index_list, index_labels)
d |
data |
m |
list of projection matrices for the planned tour |
index_list |
list of index functions to calculate for each entry |
index_labels |
labels used in the output |
index values for each interpolation step
d <- spiral_data(100, 4) m <- list(basis_matrix(1,2,4), basis_matrix(3,4,4)) index_list <- list(tourr::holes(), tourr::cmass()) index_labels <- c("holes", "cmass") trace <- get_trace(d, m, index_list, index_labels) plot_trace(trace)d <- spiral_data(100, 4) m <- list(basis_matrix(1,2,4), basis_matrix(3,4,4)) index_list <- list(tourr::holes(), tourr::cmass()) index_labels <- c("holes", "cmass") trace <- get_trace(d, m, index_list, index_labels) plot_trace(trace)
Re-evaluate index after jittering the projection by an angle alpha.
jitter_angle(proj, d, alpha, idx)jitter_angle(proj, d, alpha, idx)
proj |
Original projection plane |
d |
Data matrix |
alpha |
Jitter angle |
idx |
Index function |
New index value
Re-evaluate index after jittering all points by an amount alpha.
jitter_points(proj_data, alpha, idx)jitter_points(proj_data, alpha, idx)
proj_data |
Original projected data points |
alpha |
Jitter amount (passed into the jitter() function) |
idx |
Index function |
New index value
Generate Synthetic Noise
noise_gen(n = 500, type = "gaussian", level = 0.01, seed = NULL)noise_gen(n = 500, type = "gaussian", level = 0.01, seed = NULL)
n |
Integer. Number samples to generate. Default is 500. |
type |
Character string specifying the type of noise to generate. Supported types:
|
level |
Numeric. Controls the scale (standard deviation, range, or spread) of the noise. Default is |
seed |
Optional integer. Sets a random seed for reproducibility. |
A tibble with two columns:
value: Numeric vector of generated noise samples.
type: Character string indicating the type of noise.
# Gaussian noise with small scale noise_gen(500, type = "gaussian", level = 0.05) # Heavy-tailed noise noise_gen(500, type = "t_distributed", level = 0.1)# Gaussian noise with small scale noise_gen(500, type = "gaussian", level = 0.05) # Heavy-tailed noise noise_gen(500, type = "t_distributed", level = 0.1)
Points are drawn from a uniform distribution between -1 and 1, the pipe structure is generated by rejecting points if they are not on a circle with radius 1 and thickness t in the last two parameters.
pipe_data(n, p, t = 0.1)pipe_data(n, p, t = 0.1)
n |
number of sample points to generate |
p |
sample dimensionality |
t |
thickness of circle, default=0.1 |
sample points in tibble format
pipe_data(100, 4) pipe_data(100, 2, 0.5)pipe_data(100, 4) pipe_data(100, 2, 0.5)
Plot rotation traces of indexes obtained with profileRotation.
plot_rotation(res_mat)plot_rotation(res_mat)
res_mat |
data (result of profileRotation) |
ggplot visualisation of the tracing data
Plotting method for the results of compareSmoothing. The results are mapped by facetting over values of alpha and mapping the method (jitter_angle, jitter_points, no_smoothing) to linestyle and color (black dashed, black dotted, red solid). By default legend drawing is turned off, but can be turned on via the lPos argument, e.g. setting to "bottom" for legend below the plot.
plot_smoothing_comparison(sm_mat, lPos = "none")plot_smoothing_comparison(sm_mat, lPos = "none")
sm_mat |
Result from compare_smoothing |
lPos |
Legend position passed to ggplot2 (default is none for no legend shown) |
ggplot visualisation of the comparison
get_trace.Plot traces of indexes obtained with get_trace.
plot_trace(res_mat, rescY = TRUE)plot_trace(res_mat, rescY = TRUE)
res_mat |
data (result of get_trace) |
rescY |
bool to fix y axis range to [0,1] |
ggplot visualisation of the tracing data
Simulate and Summarize Projection Pursuit Index (PPI) Values
ppi_mean(data, index_fun, n_sim = 100, n_obs = 300)ppi_mean(data, index_fun, n_sim = 100, n_obs = 300)
data |
A data frame or matrix. Must have at least two columns. |
index_fun |
A function taking two numeric vectors ( |
n_sim |
Integer. Number of simulations. Default is 100. |
n_obs |
Integer. Number of observations to sample in each simulation. Default is 300. |
A tibble with:
var_i, var_j: Names of variable pairs
mean_index: Mean index value over simulations
data <- as.data.frame(data_gen(type = "polynomial", degree = 2)) ppi_mean(data, scag_index("stringy"), n_sim = 10)data <- as.data.frame(data_gen(type = "polynomial", degree = 2)) ppi_mean(data, scag_index("stringy"), n_sim = 10)
This function estimates the 95th percentile of a projection pursuit index under synthetic noise data.
ppi_noise_threshold( index_fun, n_sim = 100, n_obs = 500, noise_type = "gaussian", noise_level = 0.01, seed = NULL )ppi_noise_threshold( index_fun, n_sim = 100, n_obs = 500, noise_type = "gaussian", noise_level = 0.01, seed = NULL )
index_fun |
A function that takes either a 2-column matrix or two numeric vectors and returns a scalar index. |
n_sim |
Integer. Number of index evaluations to simulate. Default is 100. |
n_obs |
Integer. Number of observations per noise sample. Default is 500. |
noise_type |
Character. Type of noise to use (e.g., "gaussian", "t_distributed", etc.). Default is "gaussian". |
noise_level |
Numeric. Controls the scale/spread of the generated noise. Default is 0.01. |
seed |
Optional integer. Random seed for reproducibility. |
A single numeric value: the estimated 95th percentile of the index under noise.
ppi_noise_threshold( index_fun = scag_index("stringy"), noise_type = "cauchy", noise_level = 0.1, n_sim = 10, n_obs = 100 )ppi_noise_threshold( index_fun = scag_index("stringy"), noise_type = "cauchy", noise_level = 0.1, n_sim = 10, n_obs = 100 )
For a given index function, simulates how the index behaves across a range of sample sizes. By default, standard bivariate Gaussian noise is used, but the user may instead supply a 2-column dataset to sample from.
ppi_samplesize_effect( index_fun, sample_sizes = 100, n_sim = 100, data = NULL, csv_file = NULL, index_args = list(), workers = 1 )ppi_samplesize_effect( index_fun, sample_sizes = 100, n_sim = 100, data = NULL, csv_file = NULL, index_args = list(), workers = 1 )
index_fun |
A function returning a numeric scalar index. It may accept
either a 2-column matrix/data frame, or two numeric vectors ( |
sample_sizes |
Numeric vector (e.g., 100, c(50,100,200), seq(30,200,5)). Default is 100. |
n_sim |
Integer. Number of simulations per sample size. |
data |
Optional 2-column data frame, tibble, or matrix to sample from.
If |
csv_file |
Optional file path to save results as CSV. |
index_args |
A named list of additional arguments passed to |
workers |
Integer. Number of parallel workers to use. Default is 1. |
A tibble with:
sample_size: sample size used for each simulation block
percentile95: 95th percentile of the index values over simulations
## Not run: ppi_samplesize_effect( cassowaryr::sc_stringy, sample_sizes = c(100, 300), n_sim = 3, index_args = list(binner = "hex"), workers = 2 ) ppi_samplesize_effect( cassowaryr::sc_skinny, sample_sizes = c(100, 300), n_sim = 3, data = data_gen("polynomial", degree = 2), workers = 2 ) ## End(Not run)## Not run: ppi_samplesize_effect( cassowaryr::sc_stringy, sample_sizes = c(100, 300), n_sim = 3, index_args = list(binner = "hex"), workers = 2 ) ppi_samplesize_effect( cassowaryr::sc_skinny, sample_sizes = c(100, 300), n_sim = 3, data = data_gen("polynomial", degree = 2), workers = 2 ) ## End(Not run)
Performs simulations to compute a projection pursuit index on structured (sampled) data and on random noise, allowing a comparison of index scale across contexts.
ppi_scale( data, index_fun, noise_fun = NULL, n_sim = 100, n_obs = 500, index_args = list(), seed = NULL, workers = 1 )ppi_scale( data, index_fun, noise_fun = NULL, n_sim = 100, n_obs = 500, index_args = list(), seed = NULL, workers = 1 )
data |
A data frame or tibble with at least two numeric columns. |
index_fun |
A function returning a numeric scalar index. It may accept
either a 2-column matrix/data frame, or two numeric vectors ( |
noise_fun |
Optional function for generating noise data. It should take
|
n_sim |
Integer. Number of simulations. Default is 100. |
n_obs |
Integer. Number of observations per simulation. Default is 500. |
index_args |
A named list of additional arguments passed to |
seed |
Optional integer seed for reproducibility. |
workers |
Integer. Number of parallel workers to use. Default is 1. |
A tibble with columns:
simulation: simulation number
var_i, var_j: variable names
var_pair: pair name as a string
sigma: "structure" for structured data, "noise" for noisy data
index: index value returned by index_fun
ppi_scale( data_gen("polynomial", degree = 3), cassowaryr::sc_stringy, n_sim = 2, index_args = list(binner = "hex"), workers = 2 ) ppi_scale( data_gen("polynomial", degree = 3), scag_index("stringy"), n_sim = 2 )ppi_scale( data_gen("polynomial", degree = 3), cassowaryr::sc_stringy, n_sim = 2, index_args = list(binner = "hex"), workers = 2 ) ppi_scale( data_gen("polynomial", degree = 3), scag_index("stringy"), n_sim = 2 )
Ideally a projection pursuit index should be roation invariant, we test this explicitly by profiling the index while rotating a distribution.
profile_rotation(d, index_list, index_labels, n = 200)profile_rotation(d, index_list, index_labels, n = 200)
d |
data (2 column matrix containing distribution to be rotated) |
index_list |
list of index functions to calculate for each entry |
index_labels |
labels used in the output |
n |
number of steps in the rotation (default = 200) |
index values for each rotation step
d <- as.matrix(sin_data(30, 2)) index_list <- list(tourr::holes(), scag_index("stringy"), mine_indexE("MIC")) index_labels <- c("holes", "stringy", "mic") pRot <- profile_rotation(d, index_list, index_labels, n = 50) plot_rotation(pRot)d <- as.matrix(sin_data(30, 2)) index_list <- list(tourr::holes(), scag_index("stringy"), mine_indexE("MIC")) index_labels <- c("holes", "stringy", "mic") pRot <- profile_rotation(d, index_list, index_labels, n = 50) plot_rotation(pRot)
These are convenicence functions that format scagnostics and mine index functions for direct use with the guided tour or other functionalities in this package.
scag_index(index_name) mine_index(index_name) mine_indexE(index_name) holesR() cmassR()scag_index(index_name) mine_index(index_name) mine_indexE(index_name) holesR() cmassR()
index_name |
Index name to select from group of indexes. |
function taking 2-d data matrix and returning the index value
scag_index(): Scagnostics index from cassowaryr package
mine_index(): MINE index from minerva package
mine_indexE(): MINE index from minerva package (updated estimator)
holesR(): rescaling the tourr holes index
cmassR(): rescaling the tourr cmass index
n-1 points are drawn from a normal distribution with mean=0, sd=1, the points in the final direction are calculated as the sine of the values of direction n-1 with additional jittering controled by the jitter factor f.
sin_data(n, p, f = 1)sin_data(n, p, f = 1)
n |
number of sample points to generate |
p |
sample dimensionality |
f |
jitter factor, default=1 |
sample points in tibble format
sin_data(100, 4) sin_data(100, 2, 200)sin_data(100, 4) sin_data(100, 2, 200)
n-2 points are drawn from a normal distribution with mean=0, sd=1, the points in the final two direction are sampled along a spiral by samping the angle from a normal distribution with mean=0, sd=2*pi (absolute values are used to fix the orientation of the spiral).
spiral_data(n, p)spiral_data(n, p)
n |
number of sample points to generate |
p |
sample dimensionality |
sample points in matrix format
spiral_data(100, 4)spiral_data(100, 4)
We estimate the squint angle by interpolating from a random starting plane towards the optimal view until the index value of the selected index function is above the selected cutoff. Since this depends on the direction, this is repeated with n randomly selected planes giving a distribution representative of the squint angle.
squint_angle_estimate( data, indexF, cutoff, structure_plane, n = 100, step_size = 0.01 )squint_angle_estimate( data, indexF, cutoff, structure_plane, n = 100, step_size = 0.01 )
data |
Input data |
indexF |
Index function |
cutoff |
Threshold index value above which we assume the structure to be visible |
structure_plane |
Plane defining the optimal view |
n |
Number of random starting planes (default = 100) |
step_size |
Interpolation step size fixing the accuracy (default = 0.01) |
numeric vector containing all squint angle estimates
data <- spiral_data(50, 4) indexF <- scag_index("stringy") cutoff <- 0.7 structure_plane <- basis_matrix(3,4,4) squint_angle_estimate(data, indexF, cutoff, structure_plane, n=1)data <- spiral_data(50, 4) indexF <- scag_index("stringy") cutoff <- 0.7 structure_plane <- basis_matrix(3,4,4) squint_angle_estimate(data, indexF, cutoff, structure_plane, n=1)
Index evaluation timing may depend on the data distribution, we evaluate the computing time for a set of different projections to get an overview of the distribution of computing times.
time_sequence(d, t, idx, pmax)time_sequence(d, t, idx, pmax)
d |
Input data in matrix format |
t |
List of projection matrices (e.g. interpolated tour path) |
idx |
Index function |
pmax |
Maximum number of projections to evaluate (cut t if longer than pmax) |
numeric vector containing all distances
d <- as.matrix(spiral_data(500, 4)) t <- purrr::map(1:10, ~ tourr::basis_random(4)) idx <- scag_index("stringy") time_sequence(d, t, idx, 10)d <- as.matrix(spiral_data(500, 4)) t <- purrr::map(1:10, ~ tourr::basis_random(4)) idx <- scag_index("stringy") time_sequence(d, t, idx, 10)