vignettes/generating-NAs-using-the-species-list.Rmd
generating-NAs-using-the-species-list.Rmd
The aim of this document is demonstrate how the Species List table
(SL) of the RDBES can be used to complement the sample table with NAs in
cases where, e.g., a species was not meant to be looked for. This task
is made easy using function generateNAsUsingSL
available in
the RDBEScore
package.
# read an example dataset and simplify it to 1 trip and 1 haul [dev bote: this section needs to be reworked when data and filterRDBESDataObject are updated]
data(Pckg_survey_apistrat_H1)
myH1DataObject1 <- Pckg_survey_apistrat_H1
myH1DataObject1$SL<-myH1DataObject1$SL[grepl(myH1DataObject1$SL$SLspeclistName, pat="Pckg_survey_apistrat_H1"),]
#myH1DataObject1<-filterAndTidyRDBESDataObject(myH1DataObject1, fieldsToFilter="FOid",valuesToFilter=70849, killOrphans = TRUE)
myH1DataObject1<-filterRDBESDataObject(myH1DataObject1, fieldsToFilter="SSid",valuesToFilter=227694, killOrphans = TRUE)
# check it is a valid RDBESobject
validateRDBESDataObject(myH1DataObject1, checkDataTypes = TRUE)
The example is from data in hierarchy 1. It contains a single trip with a single haul. For simplicity, we restrict our analysis to the tables SL, SS and SA which are the ones handled by the functions we which behaviour we want to demonstrate.
Examining a print of the Species List table (SL) one can conclude that the sampling targeted the landings of only 1 species. In this case the species was Nephrops norvegicus (aphiaId 107254).
myH1DataObject1[c("SL")]
#> $SL
#> SLid SLrecType SLcou SLinst SLspeclistName
#> 1: 47891 SL ZW 4484 WGRDBES-EST_TEST_1_Pckg_survey_apistrat_H1
#> SLyear SLcatchFrac SLcommTaxon SLsppCode
#> 1: 1965 Lan 107254 107254
Examining a print of the Species Selection table (SS), one can confirm that only one fishing operation is present in the data (FOid 70849) and that landings were indeed sampled from it (for simplicity only a subset of columns is printed).
myH1DataObject1[[c("SS")]][,1:15]
#> SSid LEid FOid TEid FTid SLid OSid SSrecType SSseqNum SSstratification
#> 1: 227694 NA 70849 NA NA 47891 NA SS 1 N
#> SSstratumName SSclustering SSclusterName SSobsActTyp SScatchFra
#> 1: U N U Sort Lan
Given the previous, it is expected that if Nephrops norvegicus was sampled it will appear in the RDBES Sample table (SA). One can confirm that happened by printing that table (for simplicity only a subset of columns is printed).
Suppose we want to consult the data to produce an estimate of, e.g., cod (aphiaId 126436). That species was not targeted by the sampling programme and it is impossible to infer from the data if it was or not present alongside Nephrops norvegicus during the sampling. The total weight measured (SAtotalWtMes) of cod should therefore be considered missing (NA).
The function does that (again for convenience, only a few columns of the SA table are printed).
myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, targetAphiaId = c("126436"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> SAid SSid LEid SArecType SAseqNum SAparSequNum SAstratification
#> 1: 572813 227694 NA SA 1.000 NA N
#> 2: 572813 227694 NA SA 1.001 NA N
#> SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#> 1: U 107254 276 276
#> 2: U 126436 NA NA
Note that the new rows have floating points values for SAid, and SAseqNum (we use sprintf to ensure the decimal places are displayed). This facilitates the ordering of the samples and prevenes overlaps when different datasets are joined. Also a SAunitName was created for the new row that builds on the SAid and helps to make the row more readily identifiable.
sprintf(myH1DataObject1updte[['SA']]$SAid, fmt = '%.3f')
#> [1] "572813.000" "572813.001"
sprintf(myH1DataObject1updte[['SA']]$SAseqNum, fmt = '%.3f')
#> [1] "1.000" "1.001"
print(myH1DataObject1updte[['SA']]$SAunitName)
#> [1] "1" "NAgen_572813.001"
Note that argument targetAphiaId
in the function
generateNAsUsingSL
can also accept a vector thus allowing
generation of NAs for multiple species in one go. In the example below
Pandalus borealis is added to the call.
myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, targetAphiaId = c("126436","107649"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> SAid SSid LEid SArecType SAseqNum SAparSequNum SAstratification
#> 1: 572813 227694 NA SA 1.000 NA N
#> 2: 572813 227694 NA SA 1.001 NA N
#> 3: 572813 227694 NA SA 1.002 NA N
#> SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#> 1: U 107254 276 276
#> 2: U 126436 NA NA
#> 3: U 107649 NA NA
In many practical situations, diligent observers sometimes record more species than those expected. Such “excess” data is frequently useless from an estimation point-of-view (because the sampling is observer-dependent and therefore likely non-representative), but in analyses (e.g., distribution of rare species) or summaries (e.g., totals of biomass sampled) it may be useful to preserve them in the data.
The difference between these two cases can be specified via the
argument overwriteSampled
in the function
generateNAsUsingSL
. By default (estimation case) the
argument is set to TRUE which makes generateNAsUsingSL
set
the weights of these extra species to NA. But, by explicitly setting
that argument as overwriteSampled=FALSE
the information
collected can also kept.
To demonstrate this we carry out a small alteration of the example data, removing the Nephrops norvegicus from the Species List. This creates a somewhat atypical situation (it configures a case where of a haul where nothing was supposed to be looked for but still Nephrops norvegicus was registered) that is used here for sake of simplifying the example.
# we remove *Nephrops norvegicus*
myH1DataObject1$SL<-myH1DataObject1$SL[-1,]
validateRDBESDataObject(myH1DataObject1, checkDataTypes = TRUE)
Now we call generateNAsUsingSL
for Nephrops
norvegicus with its implicit default
overwriteSampled=TRUE
(regular estimation case). It is
noticeable that the function sets weights of that species to NA
myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1,
targetAphiaId = c("107254"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> SAid SSid LEid SArecType SAseqNum SAparSequNum SAstratification
#> 1: 572813 227694 NA SA 1 NA N
#> SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#> 1: U 107254 NA NA
If, on the other hand, we are interested in keeping all available
data, we set overwriteSampled=FALSE
myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1,
targetAphiaId = c("107254"),
overwriteSampled=FALSE)
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> SAid SSid LEid SArecType SAseqNum SAparSequNum SAstratification
#> 1: 572813 227694 NA SA 1 NA N
#> SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#> 1: U 107254 276 276