Skip to contents

This function performs supervised topic model (STM) using variational inference. It initializes topic assignments from count (optionally using a topic-word prior phi), estimates regression parameters, and repeatedly calls the C++ routine stm_vi_parallel() until convergence.

Usage

run_stm_vi(
  count,
  y,
  K,
  alpha,
  beta,
  phi = NULL,
  seed = NULL,
  max_iter = 200L,
  min_iter = 50L,
  tol_elbo = 1e-04,
  update_sigma = TRUE,
  tau = 20L,
  show_progress = TRUE,
  chunk = 5000L,
  verbose = TRUE,
  sigma2_init = NULL
)

Arguments

count

Integer matrix with 3 columns (d, v, c) in 0-based indexing. Each row represents document index d, word index v, and token count c.

y

Numeric vector of length D. Must not contain NA values.

K

Integer, number of topics. Required if phi is NULL; ignored if phi is provided (then K = ncol(phi)).

alpha

Dirichlet prior parameter for document-topic distributions.

beta

Dirichlet prior parameter for topic-word distributions.

phi

Optional V x K topic-word probability matrix used only for initializing topic assignments.

seed

Optional integer random seed used in the initialization step.

max_iter

Maximum number of variational sweeps.

min_iter

Minimum number of sweeps before checking ELBO convergence.

tol_elbo

Numeric tolerance for relative ELBO change.

update_sigma

Logical; if TRUE, update sigma2 each sweep.

tau

Numeric log-space cutoff used in stm_vi_parallel().

show_progress

Logical; print low-level progress inside C++.

chunk

Integer; number of documents per parallel block.

verbose

Logical; print ELBO and relative change per sweep.

sigma2_init

Optional numeric scalar specifying the initial noise variance. If NULL, sigma2 is estimated once by least squares.

Value

A list containing:

nd

D x K document-topic count matrix.

nw

K x V topic-word count matrix.

ndsum

Length-D vector of document token counts.

nwsum

Length-K vector of topic token counts.

eta

K-dimensional regression coefficient vector.

sigma2

Final noise variance.

phi

V x K topic-word posterior mean.

theta

D x K document-topic posterior mean.

elbo

Final ELBO.

label_loglik

Final supervised term.

elbo_trace

ELBO values per sweep.

label_loglik_trace

Label log-likelihood per sweep.

n_iter

Number of iterations actually performed.

D, V, K, NZ

Model dimensions.

Details

Convergence is assessed based on the relative changes in the evidence lower bound (ELBO) and the supervised label log-likelihood: $$ \frac{\mathrm{ELBO}_t - \mathrm{ELBO}_{t-1}}{|\mathrm{ELBO}_{t-1}|}, \qquad \frac{\ell_t - \ell_{t-1}}{|\ell_{t-1}|}. $$ After a minimum number of iterations, the algorithm is declared to have converged when both quantities are non-negative and smaller than the prescribed tolerance.

**Important:** This function assumes that the response vector y contains **no NA** values. The underlying C++ implementation does not skip missing responses and requires y[d] to be finite for all documents.