Rethinking sampling functions to include weights and numbers

Introduction: the problem

The original design of atlantisom conceived at the 2015 Atlantis Summit in Honolulu, Hawaii, USA.had the following workflow:

apply run_truth to a set of Atlantis model outputs, which returns ‘atoutput’ components in the following units by species, agecl (Atlantis cohort, often in units of maxage/10), polygon (model box), layer (model depth), and time (model toutinc or outputstep):
- biomass_eaten$atoutput is eaten biomass in t for each functional group?
  - biomass_eaten$vol is volume aggregated over layers
  - biomass_eaten$prey is the prey item eaten?
  - biomass_eaten$dietcomp is proportion in diet?
  - biomass_eaten$bio_eaten is different from atoutput but how?
- biomass_ages$atoutput is biomass in t from calc_biomass_age
- catch$atoutput is fishery catch in numbers of individuals
- nums$atoutput is numbers of individuals
- resn$atoutput is Nitrogen reserve weight in mg/m^3
- structn$atoutputis Nitrogen structural weight in mg/m^3
additional components of the run_truth return (default name result) are key factors needed for conversions and interpreting the outputs (described elsewhere):
- biolprm
- fgs
apply create_survey to the result output of run_truth
apply sample_survey_biomass to the output of create_survey to get an index
- designed to use nums output with an average weight–but estimating average weight proved problematic the way I did it, which was too imprecise
- can be used directly on biomass_ages output with a vector of average weight=1 (see True Bio Test)
apply sample_fish to the output of create_survey to generate biological samples
apply calc_age2length to output of sample_fish to generate length compositions
- FIRST BIG PROBLEM! calc_age2length needs weight data at the same level of resolution as the nums data to estimate lengths using the length-weight relationship.
- Weight data comes from resn and structn. calc_age2length was originally tested with completely disaggregated outputs of run_truth and it generates length samples correctly that way
- Initial calc_age2length testing with a real dataset revealed that it is entirely too slow to apply at the level of run_truth results for a full Atlantis model run and over many species (TrueLengthCompTest.Rmd)
- Attempts to aggregate structn and resn results to the create_survey and sample_fish level were incorrect, because:
- we cannot apply functions designed for numbers output that aggregate based on sum to resn and structn outputs which are in units of N per volume
- So actually, this problem reaches all the way back to create_survey which is designed for nums output and aggregates using a sum:

The next three biological sampling steps after creating the length composition are taking a subset of the sample_fish for lengths using sample_length (though this could be 100% sample as is often done on a survey), taking a subset of the sample_fish for ages and applying ageing error using sample_ages, and finally taking a sample of individual fish weights, perhaps the same sample that was aged as is done on some surveys) using sample_weight. (So far only sample_ages is drafted and the other two functions remain to be written.)

May 15: True age estimation needs to happen in here?

The sequence create_survey to sample_fish to calc_age2length now works. It can go as pictured below if there are not multiple true ages in an age class. Revisions to work on:

For clarity, calc_age2length should be renamed calc_stage2length if we are planning to use it directly between Atlantis cohorts and lengths as tested here.
Between sample_fish and sample_ages we need the already written calc_stage2age function, which calls calc_Z. Test these next. But see below.
Similarly, between sample_fish and sample_weight we will need to expand to true ages for a weight-at-age vector (for species where stage=age, this is already output as muweight from the calc_(st)age2length function).

BUT calc_stage2age is designed to work on true outputs (it makes no sense to expand age classes from survey-selectivity affected output). In theory this should be integrated into run_truth but I am wary of introducing this approximation so far upstream. On the other hand, what choice to we have if we want true age outputs? We either need the direct output from the most recent codebase(s) or we have to do this approximation to get age input to stock assessment models.

So, lets try integrating calc_stage2age into run_truth as planned and we can compare the outputs to what we already have for the stage data. We will want to compare census biomass, survey biomass, length comps…

+ write another truth function that works on the Atlantis true age output?
+ load YOY file into run_truth (uses load_YOY function)
+ do we have to apply before calc_biomass_age, and sub output of calc_stage2age in for nums?
+ if so, how do we deal with the structn and resn portions? 

May 16: Discussion with Isaac and Patrick: how to split weights to true age bins?
+ Simple interpolation? within a year or across years
+ Is Z calc lagged? doesnt look like it, simple within year Z like catch curve
+ Check whether there is a plus group in any Atlantis model--shouldn't be
+ First try a linear interpolation between weight groups
+ This may be more wrong for smallest fastest growing age classes
+ If so we can write something else for younger classes
+ Midpoint of weight for each age is the goal

Then we need to revise calc_age2length to work for true ages, as it is now hardcoded to expext 10 age classes.

If we do those two things, this sampling flow remains the same, but has calc_age2length as the intermediate step before lengths and weights, which will come out from the true age classes.

Proposed solution (now done for calc_age2length)

I think we should keep the workflow designed at the Atlantis summit because this is the way surveys generally operate.

We need a way to appropriately aggregate the resn and structn outputs which are densitites. A mean or median across layers and polygons would be appropriate. The run_truth function calls calc_biomass_age which uses median to aggregate these outputs to generate the biomass_ages result component.

We can either rewrite aggregateData to take another argument indicating aggregation by sum or median, or create separate functions aggregateDataSum and aggregateDataMedian to be used on the numbers/biomass or density outputs, respectively. Either way we need to ensure that the appropriate aggregation function is used in create_survey and sample_fish

Current aggregateData function, which is strangely not called by sample_fish. Or not so strange, since aggregateData defaults to aggregate only over layer and maintain polygons in the dataset?

While we are at it, catch biomass is apparently calculated by run_truth but not exported, so I think we should export that all at once. Right now only catch in numbers is exported, but catch biomass is most often what management is based on and is an input to most assessment models.

For fisheries, the plan would be to run create_fishery_subset similar to create_survey and then sample_fish with the same functions as above to get the catch at length and catch at age samples.

Current create_fishery_subset function, as yet untested:

To Do List:

write an aggregation function for resn and structn that applies the median
- Done 10 May, aggregateDensityData
rewrite create_survey to use alternate aggregation functions, or write create_survey_nums and create_survey_wt functions that use appropriate aggregation
- the latter may be better, because do we want to apply efficiency and selectivity to median weight densities? I don’t think so?
- the fish still weighs the same, we just don’t have the right proportion in the ecosystem (efficiency) or the right proportion at age (selectivty), both of which are taken care of in the numbers.
- so maybe aggregateDensityData using the standard survey polygons gets us what we need as input to sample_fish?
- Yes, did the latter was done
rewrite sample_fish to use alternate aggregation functions for each input type. Keeping this all in one function makes sense because the biological sampling already needs both numbers and weights as inputs
- Done 10 May, use sample=FALSE option

Rethinking sampling functions to include weights and numbers

Sarah Gaichas

16 May, 2019

Introduction: the problem

May 15: True age estimation needs to happen in here?

Proposed solution (now done for calc_age2length)

To Do List: