vignettes/estimating-rdbesdataobjects.Rmd
estimating-rdbesdataobjects.Rmd
The aim of this document is to illustrate how population parameters like total and mean can be estimated using the estimMC(…) function. And how a multi level or multi stratum estimation can be run to estimate the population total and mean using doEstimationForAllStrata(…).
First we’ll load some example data from the RDBES and check it’s valid. It’s a good idea to check your RDBESDataObjects are valid after any manipulations you perform.
h1d <- H1Example
validateRDBESDataObject(h1d, verbose = FALSE)
h1d
#> Hierarchy 1 RDBESdataObject:
#> DE: 8
#> SD: 8
#> VS: 1214 (SRSWOR,CENSUS,SRSWR: 2-135/4-1382)
#> FT: 1430 (CENSUS,SRSWR: 1-3/1-100)
#> FO: 1916 (CENSUS,SRSWR: 1-3/1-20)
#> SS: 1916 (CENSUS,SRSWR: 1/1-4)
#> SA: 1916 (CENSUS,SRSWR: 1/1-2)
#> FM: 7290
#> BV: 14580 (SRSWR: 2/2)
#> VD: 311
#> SL: 170
Let’s first estimate the last level values for a single FMid so that estimation is done for one top level record
FMidSel <- "4033243"
BV <- h1d$BV[h1d$BV$FMid == FMidSel,]
estimMC(as.numeric(BV$BVvalueMeas),
BV$BVnumSamp,
BV$BVnumTotal,
method=unique(BV$BVselectMeth))
#> $est.total
#> [1] 4
#>
#> $est.mean
#> [1] 2
#>
#> $est.algorithm
#> [1] "Generalized Horvitz-Thompson aka Mutiple-Count"
#>
#> $var.total
#> [1] 0
#>
#> $var.mean
#> [1] 0
#>
#> $var.algorithm
#> [1] "Sen-Yates-Grundy"
#>
#> $PI
#> [,1] [,2]
#> [1,] 0.5 0.5
#> [2,] 0.5 0.5
This estimMC(…) function is actually the core of the estimation functions running on multiple levels as well. For implementation details see: Variance calculation functions using “Multiple count” estimator
Estimation actually is done on a RDBESEstObject that is generate from RDBESDataObject using createRDBESEstObject(…). The following example uses a dataset that resembles in many ways real data at ICES area 27.3.d.28.1 from the Estonian Baltic Trawling fleet for 2022 sprat (Sprattus sprattus).The dataset has both total landings and commercial sampling data.
H8ExampleEE1
#> Hierarchy 8 RDBESdataObject:
#> DE: 1
#> SD: 1
#> TE: 11 (SRSWOR: 2-3/4)
#> VS: 14 (SRSWR: 1-2/7-12)
#> LE: 15 (SRSWOR: 1-2/1-2)
#> SS: 15 (CENSUS: 2/2)
#> SA: 15 (SRSWOR: 1/794-14268)
#> BV: 3995 (CENSUS: 16-100/16-100)
#> VD: 7
#> SL: 2
#> CL: 71
#> CE: 132
The data has not gone through RDBES upload and download procedure and contains invalid data types but also may have some other validity problems. However the fields for estimation should be present and valid.
#to check data types
validateRDBESDataObject(H8ExampleEE1, checkDataTypes = TRUE, strict = FALSE)
#create the estimation object to estimate values on the SA table
estData <- createRDBESEstObject(H8ExampleEE1, 8, "SA" )
#the SAsampWtMes is in grammes let's estimate in tonnes
estData$SAsampWtMes <- estData$SAsampWtMes / (1000 * 1000)
results <- doEstimationForAllStrata(estData, "SAsampWtMes")
#let's compare the results from 1st quarter with actual catches from Q1
CL <- H8ExampleEE1$CL
Q1months <- c("January", "February", "March")
Q1results <- results[results$recType== "TE" &
results$stratumName %in% Q1months,]
cat("Total estimated SPR catch in Q1: ",
round(sum(Q1results$est.total), 3), "tonnes\n")
Total estimated SPR catch in Q1: 381.262 tonnes
#CLoffWeight is in Kg, converting to tonnes
cat("Total reported SPR catch in Q1: ",
sum(CL[CL$CLquar ==1, "CLoffWeight"])/1000, "tonnes")
Total reported SPR catch in Q1: 396.21 tonnes
#let's add the actual totals
Q1results$actualTotals <- c(February=124.829,January=119.108, March=152.273)
So the total estimated catch (est.total) for the first quarter is near the total of the actual catch. The table below gives the estimates by month with standard errors (se.total). The (est.mean) and its standard error (SE, se.mean) are the estimated mean catch of sprat per week in the given month.
#there are 4 weeks and 2 are sampled in TE so in this case the total is:
#mean * 4
Q1results$est.mean * 4
#> [1] 209.27146 138.99464 32.99554
As you can see from the table for the first 2 months the estimates with se.total are in line with the estimate as for March somehow there is a big discrepancy.
columns <- c("recType","stratumName","actualTotals",
"est.total", "se.total", "est.mean", "se.mean")
captionText <- paste("Estmation results for 1st quarter by months (stratums).",
"Total estimated catch (est.total) and standard error (se.total)",
" as well as estimated mean with se for vessels are given.",
"In addition actual total reported weights are given.")
knitr::kable(Q1results[columns], digits=3, caption=captionText)
recType | stratumName | actualTotals | est.total | se.total | est.mean | se.mean | |
---|---|---|---|---|---|---|---|
56 | TE | February | 124.829 | 209.271 | 97.071 | 52.318 | 24.268 |
57 | TE | January | 119.108 | 138.995 | 41.709 | 34.749 | 10.427 |
58 | TE | March | 152.273 | 32.996 | 7.213 | 8.249 | 1.803 |
In the next sections we will use data from R packages survey and SDAResources that are converted into the RDBESDataObject to demonstrate the estimation procedure.
From SDAResources we use agstrat which is data from a stratified random sample of size 300 from the 1992 U.S. Census of Agriculture agpop data. The datset is converted into RDBESDataObject Pckg_SDAResources_agstrat_H1 so that Table VS and SA contain the core information of used in Lohr examples 3.2 and 3.6. Table VS is stratified with VSstratumName set to region, and VSnumberSampled and VSnumberTotal set according to agstrat VSunitName is set to a combination of original county, state, region and agstrat row numbers. Table SA contains the variable measured acres92 in , and set to 1. Table DE, SD, FT and FO are for the most dummy tables inserted to meet RDBES model requirements to be aggregated during estimation tests. Values of mandatory fields have dummy values taken from an onboard programme, with exception of * - that is set to CENSUS - they should be ignored during estimation. BV, FM, CL, and CE are not provided. SL and VD are subset to the essential rows.
agstrat_H1 <- Pckg_SDAResources_agstrat_H1 #valid
agstrat_H1
#> Hierarchy 1 RDBESdataObject:
#> DE: 1
#> SD: 1
#> VS: 300 (SRSWOR: 21-135/220-1382)
#> FT: 300 (CENSUS: 1/1)
#> FO: 300 (CENSUS: 1/1)
#> SS: 300 (CENSUS: 1/1)
#> SA: 300 (CENSUS: 1/1)
#> FM: 0
#> BV: 0
#> VD: 311
#> SL: 1
#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(agstrat_H1, 1, "SA")
res <- doEstimationForAllStrata(estObj, "SAsampWtMes")
# Get the estimated total and mean for "SAsampWtMes" for the VS stratum "NC"
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" & res$stratumName == "NC" ,columns2Get]),1)
#> est.total est.mean se.total se.mean
#> 316731379.7 300504.2 16977399.2 16107.6
How to interpret these results in the above example?
You can explore the result to see the estimated total, mean etc. for other stratums.
From survey we use apiclus2 which is the Academic Performance Index computed for all California schools based on standardised testing of students. The data sets contain information for all schools with at least 100 students and for various probability samples of the data. The design is 2-stage cluster sampling with clusters of unequal sizes An SRS of 40 districts is selected (psus) from the 757 districts in the population and then up to 5 schools (min 1) were selected from each district (ssus) The datset is converted into RDBESDataObject Pckg_survey_apiclus2_H1 so that 1 DE row with DEstratumName == “Pckg_SDAResources_apiclus2_H1”, 1 child SD row, 40 child rows in VS (the 40 districts), VSnumberTotal is 757, VSnumberSampled is 40, 126 child rows in FT (the 126 schools finally observed), each associated to its cluster (dname), FTnumberTotal is the number of schools in district, FTnumberSAmpled is 1…5 schools sampled, tables FO, SS are just 1:1 links to the final data (in SA), in SA SAsampleWeightMeasured is enroll (NB! there are 4 NAs)
apiclus2_H1 <- Pckg_survey_apiclus2_H1 # valid H1
apiclus2_H1
#> Hierarchy 1 RDBESdataObject:
#> DE: 1
#> SD: 1
#> VS: 40 (SRSWOR: 40/757)
#> FT: 126 (SRSWOR: 1-5/1-72)
#> FO: 126 (CENSUS: 10/10)
#> SS: 126 (CENSUS: 1/1)
#> SA: 126 (CENSUS: 1/1)
#> FM: 0
#> BV: 0
#> VD: 311
#> SL: 1
#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(apiclus2_H1, 1, "SA")
# Run estimation on total measured weight: SAtotalWtMes
res <- doEstimationForAllStrata(estObj, "SAtotalWtMes")
# Get the estimated total and mean for "SAtotalWtMes" for the VS
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" ,columns2Get]),1)
#> est.total est.mean se.total se.mean
#> 2639272.9 3486.5 798295.7 1054.6
How to interpret these results in the above example?
The SE values are just the standard errors for the total and mean respectively.
See also other package vignettes: