Package 'CMF' reference manual

Title:	Collective Matrix Factorization
Description:	Collective matrix factorization (CMF) finds joint low-rank representations for a collection of matrices with shared row or column entities. This code learns a variational Bayesian approximation for CMF, supporting multiple likelihood potentials and missing data, while identifying both factors shared by multiple matrices and factors private for each matrix. For further details on the method see Klami et al. (2014) <arXiv:1312.5921>. The package can also be used to learn Bayesian canonical correlation analysis (CCA) and group factor analysis (GFA) models, both of which are special cases of CMF. This is likely to be useful for people looking for CCA and GFA solutions supporting missing data and non-Gaussian likelihoods. See Klami et al. (2013) <https://research.cs.aalto.fi/pml/online-papers/klami13a.pdf> and Virtanen et al. (2012) <http://proceedings.mlr.press/v22/virtanen12.html> for details on Bayesian CCA and GFA, respectively.
Authors:	Arto Klami [aut], Lauri Väre [aut], Felix Held [ctb, cre]
Maintainer:	Felix Held <[email protected]>
License:	GPL (>= 2)
Version:	1.0.3.99
Built:	2025-04-02 04:35:47 UTC
Source:	https://github.com/cyianor/cmf

Collective Matrix Factorization (CMF)

Description

Collective matrix factorization (CMF) finds joint low-rank representations for a collection of matrices with shared row or column entities. This package learns a variational Bayesian approximation for CMF, supporting multiple likelihood potentials and missing data, while identifying both factors shared by multiple matrices and factors private for each matrix.

Details

This package implements a variational Bayesian approximation for CMF, following the presentation in "Group-sparse embeddings in collective matrix factorization" (see references below).

The main functionality is provided by the function CMF() that is used for learning the model, and by the function predictCMF() that estimates missing entries based on the learned model. These functions take as input lists of matrices in a specific sparse format that stores only the observed entries but that explicitly stores zeroes (unlike most sparse matrix representations). For converting between regular matrices and this sparse format see matrix_to_triplets() and triplets_to_matrix().

The package can also be used to learn Bayesian canonical correlation analysis (CCA) and group factor analysis (GFA) models, both of which are special cases of CMF. This is likely to be useful for people looking for CCA and GFA solutions supporting missing data and non-Gaussian likelihoods.

Author(s)

Arto Klami [email protected] and Lauri Väre

Maintainer: Felix Held [email protected]

References

Arto Klami, Guillaume Bouchard, and Abhishek Tripathi. Group-sparse embeddings in collective matrix factorization. arXiv:1312.5921, 2013.

Arto Klami, Seppo Virtanen, and Samuel Kaski. Bayesian canonical correlation analysis. Journal of Machine Learning Research, 14(1):965–1003, 2013.

Seppo Virtanen, Arto Klami, Suleiman A. Khan, and Samuel Kaski. Bayesian group factor analysis. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, volume 22 of JMLR:W&CP, pages 1269-1277, 2012.

Examples

require("CMF")

# Create data for a circular setup with three matrices and three
# object sets of varying sizes.
X <- list()
D <- c(10, 20, 30)
inds <- matrix(0, nrow = 3, ncol = 2)

# Matrix 1 is between sets 1 and 2 and has continuous data
inds[1, ] <- c(1, 2)
X[[1]] <- matrix(
  rnorm(D[inds[1, 1]] * D[inds[1, 2]], 0, 1),
  nrow = D[inds[1, 1]]
)

# Matrix 2 is between sets 1 and 3 and has binary data
inds[2, ] <- c(1, 3)
X[[2]] <- matrix(
  round(runif(D[inds[2, 1]] * D[inds[2, 2]], 0, 1)),
  nrow = D[inds[2, 1]]
)

# Matrix 3 is between sets 2 and 3 and has count data
inds[3, ] <- c(2, 3)
X[[3]] <- matrix(
  round(runif(D[inds[3, 1]] * D[inds[3, 2]], 0, 6)),
  nrow = D[inds[3, 1]]
)

# Convert the data into the right format
triplets <- lapply(X, matrix_to_triplets)

# Missing entries correspond to missing rows in the triple representation
# so they can be removed from training data by simply taking a subset
# of the rows.
train <- list()
test <- list()
keep_for_training <- c(100, 200, 300)
for (m in 1:3) {
  subset <- sample(nrow(triplets[[m]]), keep_for_training[m])
  train[[m]] <- triplets[[m]][subset, ]
  test[[m]] <- triplets[[m]][-subset, ]
}

# Learn the model with the correct likelihoods
K <- 4
likelihood <- c("gaussian", "bernoulli", "poisson")
opts <- getCMFopts()
opts$iter.max <- 500 # Less iterations for faster computation
model <- CMF(train, inds, K, likelihood, D, test = test, opts = opts)

# Check the predictions
# Note that the data created here has no low-rank structure,
# so we should not expect good accuracy.
print(test[[1]][1:3, ])
print(model$out[[1]][1:3, ])

# predictions for the test set using the previously learned model
out <- predictCMF(test, model)
print(out$out[[1]][1:3, ])
print(out$error[[1]])
# ...this should be the same as the output provided by CMF()
print(model$out[[1]][1:3, ])

require("CMF")

# Create data for a circular setup with three matrices and three
# object sets of varying sizes.
X <- list()
D <- c(10, 20, 30)
inds <- matrix(0, nrow = 3, ncol = 2)

# Matrix 1 is between sets 1 and 2 and has continuous data
inds[1, ] <- c(1, 2)
X[[1]] <- matrix(
  rnorm(D[inds[1, 1]] * D[inds[1, 2]], 0, 1),
  nrow = D[inds[1, 1]]
)

# Matrix 2 is between sets 1 and 3 and has binary data
inds[2, ] <- c(1, 3)
X[[2]] <- matrix(
  round(runif(D[inds[2, 1]] * D[inds[2, 2]], 0, 1)),
  nrow = D[inds[2, 1]]
)

# Matrix 3 is between sets 2 and 3 and has count data
inds[3, ] <- c(2, 3)
X[[3]] <- matrix(
  round(runif(D[inds[3, 1]] * D[inds[3, 2]], 0, 6)),
  nrow = D[inds[3, 1]]
)

# Convert the data into the right format
triplets <- lapply(X, matrix_to_triplets)

# Missing entries correspond to missing rows in the triple representation
# so they can be removed from training data by simply taking a subset
# of the rows.
train <- list()
test <- list()
keep_for_training <- c(100, 200, 300)
for (m in 1:3) {
  subset <- sample(nrow(triplets[[m]]), keep_for_training[m])
  train[[m]] <- triplets[[m]][subset, ]
  test[[m]] <- triplets[[m]][-subset, ]
}

# Learn the model with the correct likelihoods
K <- 4
likelihood <- c("gaussian", "bernoulli", "poisson")
opts <- getCMFopts()
opts$iter.max <- 500 # Less iterations for faster computation
model <- CMF(train, inds, K, likelihood, D, test = test, opts = opts)

# Check the predictions
# Note that the data created here has no low-rank structure,
# so we should not expect good accuracy.
print(test[[1]][1:3, ])
print(model$out[[1]][1:3, ])

# predictions for the test set using the previously learned model
out <- predictCMF(test, model)
print(out$out[[1]][1:3, ])
print(out$error[[1]])
# ...this should be the same as the output provided by CMF()
print(model$out[[1]][1:3, ])

Collective Matrix Factorization

Description

Learns the CMF model for a given collection of M matrices. The code learns the parameters of a variational approximation for CMF, and also computes predictions for indices specified in test.

Usage

CMF(X, inds, K, likelihood, D, test = NULL, opts = NULL)
CMF(X, inds, K, likelihood, D, test = NULL, opts = NULL)

Arguments

`X`	List of input matrices.
`inds`	A `length(X)` times 2 matrix that links dimensions of the matrices in `X` to object sets. `inds[m, 1]` tells which object set corresponds to the rows in matrix `X[[m]]`, and `inds[m, 2]` tells the same for the columns.
`K`	The number of factors.
`likelihood`	A list of likelihood choices, one for each matrix in X. Each entry should be a string with possible values of: "gaussian", "bernoulli" or "poisson".
`D`	A vector containing sizes of each object set.
`test`	A list of test matrices. If not NULL, the code will compute predictions for these elements of the matrices. This duplicates the functionality of `predictCMF()`.
`opts`	A list of options as given by `getCMFopts()`. If set to `NULL`, the default values will be used.

Details

The variational approximation is fully factorized over all of the model parameters, including individual elements of the projection matrices. The parameters for the projection matrices are updated jointly by Newton-Raphson method, whereas the rest use closed-form updates.

Note that the input data needs to be given in a specific sparse format. See matrix_to_triplets() for details.

The behavior of the algorithm can be modified via the opts parameter. See getCMFopts() for details. Of particular interest are the elements useBias and method.

For full description of the output parameters, see the referred publication. The notation in the code follows roughly the notation used in the paper.

Value

A list of

`U`	A list of the mean parameters for the rank-K projection matrices, one for each object set.
`covU`	A list of the variance parameters for the rank-K projection matrices, one for each object set.
`tau`	A vector of the precision parameter means.
`alpha`	A vector of the ARD parameter means.
`cost`	A vector of variational lower bound values.
`inds`	The input parameter `inds` stored for further use.
`errors`	A vector containing root-mean-square errors for each iteration, computed over the elements indicated by the `test` parameter.
`bias`	A list (of lists) storing the parameters of the row and column bias terms.
`D`	The sizes of the object sets as given in the parameters.
`K`	The number of components as given in the parameters.
`Uall`	Matrices of U joined into one sum(D) by K matrix, for easier plotting of the results.
`items`	A list containing the running number for each item among all object sets. This corresponds to rows of the `Uall` matrix. Each part of the list contains a vector that has the numbers for each particular object set.
`out`	If test matrices were provided, returns the reconstructed data sets. Otherwise returns `NULL`.
`M`	The number of input matrices.
`likelihood`	The likelihoods of the matrices.
`opts`	The options used for running the code.

Author(s)

Arto Klami and Lauri Väre

References

Arto Klami, Guillaume Bouchard, and Abhishek Tripathi. Group-sparse embeddings in collective matrix factorization. arXiv:1312.5921, 2014.

Examples

# See CMF-package for an example.

# See CMF-package for an example.

Default options for CMF

Description

A helper function that creates a list of options to be passed to CMF. To run the code with other option values, first run this function and then directly modify the entries before passing the list to CMF.

Usage

getCMFopts()
getCMFopts()

Details

Most of the parameters are for controlling the optimization, but some will alter the model itself. In particular, useBias is used for turning the bias terms on and off, and method will change the prior for U.

The default choice for method is "gCMF", providing the group-wise sparse CMF that identifies both shared and private factors (see Klami et al. (2013) for details). The value "CMF" turns off the group-wise sparsity, providing a CMF solution that attempts to learn only factors shared by all matrices. Finally, method="GFA" implements the group factor analysis (GFA) method, by fixing the variance of U[[1]] to one and forcing useBias = FALSE. Then U[[1]] can be interpreted as latent variables with unit variance and zero mean, as assumed by GFA and CCA (special case of GFA with M = 2). Note that as a multi-view learning method "GFA" requires all matrices to share the same rows, the very first entity set.

Value

Returns a list of:

`init.tau`	Initial value for the noise precisions. Only matters for Gaussian likelihood.
`init.alpha`	Initial value for the automatic relevance determination (ARD) prior precisions.
`grad.reg`	The regularization parameter for the under-relaxed Newton iterations. 0 = no regularization, larger values provide increasing regularization. The value must be below 1.
`gradIter`	How many gradient steps for updating the projections are performed during each iteration of the whole algorithm. Default is 1.
`grad.max`	Maximum absolute change for the elements of the projection matrices during one gradient step. Small values help to prevent over-shooting, wheres inf results to no constraints. Default is `inf`.
`iter.max`	Number of iterations for the whole algorithm.
`computeCost`	Should the cost function values be computed or not. Defaults to `TRUE`.
`verbose`	0 = supress all printing, 1 = print current iteration and test RMSE every now and then, 2 = in addition to level 1 print also the current gradient norm.
`freq`	How often progress updates are printed if verbose is `TRUE`. Default is 10.
`useBias`	Set this to `FALSE` to exclude the row and column bias terms. The default is `TRUE`.
`method`	Default value of "gCMF" computes the CMF with group-sparsity. The other possible values are "CMF" for turning off the group-sparsity prior, and "GFA" for implementing group factor analysis (and canonical correlation analysis when `M = 2`).
`prior.alpha_0`	Hyperprior values for the gamma prior for ARD.
`prior.alpha_0t`	Hyperprior values for the gamma prior for tau.

Author(s)

Arto Klami and Lauri Väre

References

Arto Klami, Guillaume Bouchard, and Abhishek Tripathi. Group-sparse embeddings in collective matrix factorization. arXiv:1312.5921, 2014.

Examples


CMF_options <- getCMFopts()
CMF_options$iter.max <- 500 # Change the number of iterations from default
                            # of 200 to 500.
CMF_options$useBias <- FALSE # Do not take row and column means into
                             # consideration.
# These options will be in effect when CMF_options is passed on to CMF.

CMF_options <- getCMFopts()
CMF_options$iter.max <- 500 # Change the number of iterations from default
                            # of 200 to 500.
CMF_options$useBias <- FALSE # Do not take row and column means into
                             # consideration.
# These options will be in effect when CMF_options is passed on to CMF.

Conversion from matrix to coordinate/triplet format

Description

The CMF code requires inputs to be speficied in a specific sparse format. This function converts regular R matrices into that format.

Usage

matrix_to_triplets(orig)
matrix_to_triplets(orig)

Arguments

orig

A matrix of class matrix

Details

The element X[i, j] on the i-th row and j-th column is represented as a triple ⁠(i, j, X[i,k])⁠. The input for CMF is then a matrix where each row specifies one element, and hence the representation is of size ⁠N x 3⁠, where N is the total number of observed entries.

In the original input matrix the missing entries should be marked as NA. In the output they will be completely omitted.

Even though this format reminds the representation often used for representing sparse matrices, it is important to notice that observed zeroes are retained in the representation. The elements missing from this representation are considered unknown, not zero.

Value

The input matrix in triplet/coordinate format.

Author(s)

Arto Klami and Lauri Väre

Examples


x <- matrix(c(1, 2, NA, NA, 5, 6), nrow = 3)
triplet <- matrix_to_triplets(x)
print(triplet)

x <- matrix(c(1, 2, NA, NA, 5, 6), nrow = 3)
triplet <- matrix_to_triplets(x)
print(triplet)

Internal function for checking whether the input is in the right format

Description

Internal function for checking whether the input is in the right format

Usage

p_check_sparsity(mat, max_row, max_col)
p_check_sparsity(mat, max_row, max_col)

Arguments

`mat`	An input matrix of class `matrix`
`max_row`	Maximum row index for `mat`
`max_col`	Maximum column index for `mat`

Value

TRUE if the input is in coordinate/triplet format. FALSE otherwise.

Predict with CMF

Description

Code for predicting missing elements with an existing CMF model. The predictions are made for all of the elements specified in the list of input matrices X. The function also returns the root mean square error (RMSE) between the predicted outputs and the values provided in X.

Usage

predictCMF(X, model)
predictCMF(X, model)

Arguments

`X`	A list of sparse matrices specifying the indices for which to make the predictions. These matrices must correspond to the structure used for `X` when learning the model with `CMF`.
`model`	A list of model parameter values provided by `CMF`.

Details

Note that X needs to be provided as a set of triplets instead of as a regular matrix. See matrix_to_triplets().

Value

A list of

`out`	A list of matrices corresponding to predictions for each matrix in `X`.
`error`	A vector containing the root-mean-square error for each matrix separately.

Author(s)

Arto Klami and Lauri Väre

Examples


# See CMF-package for an example.

# See CMF-package for an example.

Conversion from triplet/coordinate format to matrix

Description

This function is the inverse of matrix_to_triplets(). It converts a matrix represented as a set of triplets into an object of the class matrix. The missing entries (the ones not present in the triplet representation) are filled in as NA.

Usage

triplets_to_matrix(triplets)
triplets_to_matrix(triplets)

Arguments

triplets

A matrix in triplet/coordinate format

Details

See matrix_to_triplets() for a description of the representation.

Value

The input matrix as a normal matrix of class matrix

Author(s)

Arto Klami and Lauri Väre

Examples


x <- matrix(c(1, 2, NA, NA, 5, 6), nrow = 3)
triplet <- matrix_to_triplets(x)
print(triplet)
xnew <- triplets_to_matrix(triplet)
print(xnew)

x <- matrix(c(1, 2, NA, NA, 5, 6), nrow = 3)
triplet <- matrix_to_triplets(x)
print(triplet)
xnew <- triplets_to_matrix(triplet)
print(xnew)

Package 'CMF'

Help Index

Collective Matrix Factorization (CMF)

Description

Details

Author(s)

References

Examples

Collective Matrix Factorization

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Default options for CMF

Description

Usage

Details

Value

Author(s)

References

See Also

Examples

Conversion from matrix to coordinate/triplet format

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Internal function for checking whether the input is in the right format

Description

Usage

Arguments

Value

Predict with CMF

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Conversion from triplet/coordinate format to matrix

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples