Calculates the expectation-based scan statistic. See details below.
Usage
scan_eb_zip(
counts,
zones,
baselines = NULL,
probs = NULL,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE,
rel_tol = 0.001
)
Arguments
- counts
Either:
A matrix of observed counts. Rows indicate time and are ordered from least recent (row 1) to most recent (row
nrow(counts)
). Columns indicate locations, numbered from 1 and up. Ifcounts
is a matrix, the optional matrix argumentsbaselines
andprobs
should also be specified.A data frame with columns "time", "location", "count", "baseline", "prob". The baselines are the expected values of the counts, and "prob" are the structural zero probabilities of the counts. If "baseline" and "prob" are not found as columns, their values are estimated in a very heuristic fashion (not recommended). If population numbers are available, they can be included in a column "population" to help with the estimation.
- zones
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone.
- baselines
Optional. A matrix of the same dimensions as
counts
. Holds the Poisson mean parameter of the ZIP distribution for each observed count. These parameters are typically estimated from past data using e.g. ZIP regression.- probs
Optional. A matrix of the same dimensions as
counts
. Holds the structural zero probability of the ZIP distribution for each observed count. These parameters are typically estimated from past data using e.g. ZIP regression.- population
Optional. A matrix or vector of populations for each location. Only needed if
baselines
andprobs
are to be estimated and you want to account for the different populations in each location (and time). If a matrix, should be of the same dimensions ascounts
. If a vector, should be of the same length as the number of columns incounts
.- n_mcsim
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a \(P\)-value.
- gumbel
Logical: should a Gumbel P-value be calculated? Default is
FALSE
.- max_only
Boolean. If
FALSE
(default) the log-likelihood ratio statistic for each zone and duration is returned. IfTRUE
, only the largest such statistic (i.e. the scan statistic) is returned, along with the corresponding zone and duration.- rel_tol
A positive scalar. If the relative change in the incomplete information likelihood is less than this value, then the EM algorithm is deemed to have converged.
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, the relative risk, and the number of iterations until convergence for the EM algorithm. In order, the elements of this list are named
zone_number, locations, duration, score, relative_risk, n_iter
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risk, number of EM iterations. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo \(P\)-value.
- Gumbel_pvalue
A \(P\)-value obtained by fitting a Gumbel distribution to the replicate scan statistics.
- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
Details
For the expectation-based zero-inflated Poisson scan statistic
(Allévius & Höhle 2017), the null hypothesis of no anomaly holds that
the count observed at each location \(i\) and duration \(t\) (the
number of time periods before present) has a zero-inflated Poisson
distribution with expected value parameter \(\mu_{it}\) and structural
zero probability \(p_{it}\):
$$
H_0 : Y_{it} \sim \textrm{ZIP}(\mu_{it}, p_{it}).
$$
This holds for all locations \(i = 1, \ldots, m\) and all durations
\(t = 1, \ldots,T\), with \(T\) being the maximum duration considered.
Under the alternative hypothesis, there is a space-time window \(W\)
consisting of a spatial zone \(Z \subset \{1, \ldots, m\}\) and a time
window \(D \subseteq \{1, \ldots, T\}\) such that the counts in that
window have their Poisson expected value parameters inflated by a factor
\(q_W > 1\) compared to the null hypothesis:
$$
H_1 : Y_{it} \sim \textrm{ZIP}(q_W \mu_{it}, p_{it}), ~~(i,t) \in W.
$$
For locations and durations outside of this window, counts are assumed to
be distributed as under the null hypothesis. The sets \(Z\) considered
are those specified in the argument zones
, while the maximum
duration \(T\) is taken as the maximum value in the column
duration
of the input table
.
For each space-time window \(W\) considered, (the log of) a likelihood ratio is computed using the distributions under the alternative and null hypotheses, and the expectation-based Poisson scan statistic is calculated as the maximum of these quantities over all space-time windows. The expectation-maximization (EM) algorithm is used to obtain maximum likelihood estimates.
References
Allévius, B. and Höhle, M, An expectation-based space-time scan statistic for ZIP-distributed data, (Technical report), Link to PDF.
Examples
if (FALSE) {
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
probs <- matrix(runif(n_total) / 4, max_duration, n_locs)
counts <- gamlss.dist::rZIP(n_total, baselines, probs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- gamlss.dist::rZIP(
ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols],
probs[ob_rows, ob_cols])
res <- scan_eb_zip(counts = counts,
zones = zones,
baselines = baselines,
probs = probs,
n_mcsim = 99,
max_only = FALSE,
rel_tol = 1e-3)
}