Calculate the expectation-based Poisson scan statistic.
Source:R/scan_eb_poisson.R
scan_eb_poisson.Rd
Calculate the expectation-based Poisson scan statistic devised by Neill et al. (2005).
Usage
scan_eb_poisson(
counts,
zones,
baselines = NULL,
population = NULL,
n_mcsim = 0,
gumbel = FALSE,
max_only = FALSE
)
Arguments
- counts
Either:
A matrix of observed counts. Rows indicate time and are ordered from least recent (row 1) to most recent (row
nrow(counts)
). Columns indicate locations, numbered from 1 and up. Ifcounts
is a matrix, the optional matrix argumentbaselines
should also be specified.A data frame with columns "time", "location", "count", "baseline". Alternatively, the column "baseline" can be replaced by a column "population". The baselines are the expected values of the counts.
- zones
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone.
- baselines
Optional. A matrix of the same dimensions as
counts
. Not needed ifcounts
is a data frame. Holds the Poisson mean parameter for each observed count. Will be estimated if not supplied (requires thepopulation
argument). These parameters are typically estimated from past data using e.g. Poisson (GLM) regression.- population
Optional. A matrix or vector of populations for each location. Not needed if
counts
is a data frame. Ifcounts
is a matrix,population
is only needed ifbaselines
are to be estimated and you want to account for the different populations in each location (and time). If a matrix, should be of the same dimensions ascounts
. If a vector, should be of the same length as the number of columns incounts
.- n_mcsim
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a \(P\)-value.
- gumbel
Logical: should a Gumbel P-value be calculated? Default is
FALSE
.- max_only
Boolean. If
FALSE
(default) the log-likelihood ratio statistic for each zone and duration is returned. IfTRUE
, only the largest such statistic (i.e. the scan statistic) is returned, along with the corresponding zone and duration.
Value
A list which, in addition to the information about the type of scan statistic, has the following components:
- MLC
A list containing the number of the zone of the most likely cluster (MLC), the locations in that zone, the duration of the MLC, the calculated score, and the relative risk. In order, the elements of this list are named
zone_number, locations, duration, score, relative_risk
.- observed
A data frame containing, for each combination of zone and duration investigated, the zone number, duration, score, relative risk. The table is sorted by score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding to the MLC.- replicates
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
- MC_pvalue
The Monte Carlo \(P\)-value.
- Gumbel_pvalue
A \(P\)-value obtained by fitting a Gumbel distribution to the replicate scan statistics.
- n_zones
The number of zones scanned.
- n_locations
The number of locations.
- max_duration
The maximum duration considered.
- n_mcsim
The number of Monte Carlo replicates made.
References
Neill, D. B., Moore, A. W., Sabhnani, M. and Daniel, K. (2005). Detection of emerging space-time clusters. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, 218.
Examples
if (FALSE) {
set.seed(1)
# Create location coordinates, calculate nearest neighbors, and create zones
n_locs <- 50
max_duration <- 5
n_total <- n_locs * max_duration
geo <- matrix(rnorm(n_locs * 2), n_locs, 2)
knn_mat <- coords_to_knn(geo, 15)
zones <- knn_zones(knn_mat)
# Simulate data
baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs)
counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs)
# Inject outbreak/event/anomaly
ob_dur <- 3
ob_cols <- zones[[10]]
ob_rows <- max_duration + 1 - seq_len(ob_dur)
counts[ob_rows, ob_cols] <- matrix(
rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]),
length(ob_rows), length(ob_cols))
res <- scan_eb_poisson(counts = counts,
zones = zones,
baselines = baselines,
n_mcsim = 99,
max_only = FALSE)
}