Skip to contents

Calculate the expectation-based negative binomial scan statistic devised by Tango et al. (2011).

Usage

scan_eb_negbin(
  counts,
  zones,
  baselines = NULL,
  thetas = 1,
  type = c("hotspot", "emerging"),
  n_mcsim = 0,
  gumbel = FALSE,
  max_only = FALSE
)

Arguments

counts

Either:

  • A matrix of observed counts. Rows indicate time and are ordered from least recent (row 1) to most recent (row nrow(counts)). Columns indicate locations, numbered from 1 and up. If counts is a matrix, the optional matrix arguments baselines and thetas should also be specified.

  • A data frame with columns "time", "location", "count", "baseline", "theta". See the description of the optional arguments baselines and thetas below to see their definition.

zones

A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone.

baselines

Optional. A matrix of the same dimensions as counts. Holds the expected value parameter for each observed count. These parameters are typically estimated from past data using e.g. GLM.

thetas

Optional. A matrix of the same dimensions as counts, or a scalar. Holds the dispersion parameter of the distribution, which is such that if \(\mu\) is the expected value, the variance is \(\mu+\mu^2/\theta\). These parameters are typically estimated from past data using e.g. GLM. If a scalar is supplied, the dispersion parameter is assumed to be the same for all locations and time points.

type

A string, either "hotspot" or "emerging". If "hotspot", the relative risk is assumed to be fixed over time. If "emerging", the relative risk is assumed to increase with the duration of the outbreak.

n_mcsim

A non-negative integer; the number of replicate scan statistics to generate in order to calculate a \(P\)-value.

gumbel

Logical: should a Gumbel P-value be calculated? Default is FALSE.

max_only

Boolean. If FALSE (default) the statistic calculated for each zone and duration is returned. If TRUE, only the largest such statistic (i.e. the scan statistic) is returned, along with the corresponding zone and duration.

Value

A list which, in addition to the information about the type of scan statistic, has the following components:

MLC

A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, and the calculated score. In order, the elements of this list are named zone_number, locations, duration, score.

observed

A data frame containing, for each combination of zone and duration investigated, the zone number, duration, and score. The table is sorted by score with the top-scoring location on top. If max_only = TRUE, only contains a single row corresponding to the MLC.

replicates

A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.

MC_pvalue

The Monte Carlo \(P\)-value.

Gumbel_pvalue

A \(P\)-value obtained by fitting a Gumbel distribution to the replicate scan statistics.

n_zones

The number of zones scanned.

n_locations

The number of locations.

max_duration

The maximum duration considered.

n_mcsim

The number of Monte Carlo replicates made.

References

Tango, T., Takahashi, K. & Kohriyama, K. (2011), A space-time scan statistic for detecting emerging outbreaks, Biometrics 67(1), 106–115.

Examples

if (FALSE) {
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)

# Simulate data
 baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
 thetas <- matrix(runif(n_total, 0.05, 3), max_duration, n_locs)
 counts <- matrix(rnbinom(n_total,  mu = baselines,  size = thetas), 
                  max_duration, n_locs)

# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
  rnbinom(ob_dur * length(ob_cols), 
          mu = 2 * baselines[ob_rows, ob_cols],
          size = thetas[ob_rows, ob_cols]),
  length(ob_rows), length(ob_cols))
res <- scan_eb_negbin(counts = counts,
                      zones = zones,
                      baselines = baselines,
                      thetas = thetas,
                      type = "hotspot",
                      n_mcsim = 99,
                      max_only = FALSE)
}