Skip to contents

The model combines unsupervised topic modeling (LDA) with a Gaussian response on document-level topic proportions.

Usage

stm_vi_parallel(
  mod,
  docs,
  y,
  ndsum,
  NZ,
  V,
  K,
  alpha,
  beta,
  update_sigma = TRUE,
  tau = 20L,
  show_progress = TRUE,
  chunk = 5000L
)

Arguments

mod

A list containing the current model state:

nd

D x K matrix of document-topic counts.

nw

K x V matrix of topic-word counts.

eta

Numeric vector of length K; regression coefficients.

sigma2

Scalar noise variance for the Gaussian response.

docs

IntegerMatrix of size NZ x 3, where each row is a triple (d, v, c) in 0-based indexing: document index d, word index v, and count c = n_dv. Rows with d outside [0, D-1] are ignored.

y

NumericVector of length D; response y_d for each document.

ndsum

IntegerVector of length D; total token count per document (that is, ndsum[d] = sum_v n_dv).

NZ

Integer, number of non-zero entries in docs (rows of docs).

V

Integer, vocabulary size.

K

Integer, number of topics.

alpha

Scalar Dirichlet prior parameter for document-topic distributions theta_d (symmetric prior with parameter alpha).

beta

Scalar Dirichlet prior parameter for topic-word distributions phi_k (symmetric prior with parameter beta).

update_sigma

Logical; if TRUE, update the noise variance sigma2 from residuals y_d - zbar_d^T eta, otherwise keep sigma2 fixed.

tau

Numeric, log-space cutoff used to prune very small topic responsibilities phi[d,i,k] for numerical stability and efficiency.

show_progress

Logical; if TRUE, print simple progress output during the E-step over documents.

chunk

Integer, number of documents to process per parallel block in the E-step. Larger values reduce overhead but may use more memory.

Value

A list with updated variational parameters and diagnostics:

nd

Updated D x K document-topic counts.

nw

Updated K x V topic-word counts.

eta

Updated K-dimensional regression coefficient vector.

sigma2

Updated scalar noise variance.

elbo

Scalar evidence lower bound (approximate).

label_loglik

Gaussian response log-likelihood component.

Details

$$y_d \sim N(zbar_d^T eta, sigma^2).$$

This function performs one variational inference sweep with a parallel document-level E-step and simple updates for the regression parameters.