Data

JUNE_NZ requires a number of input files.

The script cli_data.py is provided to create all the required inputs for JUNE_NZ.

cli_data --workdir <Working directory>
         --cfg <Data configuration>
         --scale <Population scale>
         --disease_cfg_dir <Disease configuration directory>
         --policy_cfg_path <Policy configuration path>
         [--exclude_super_areas A1, A2]
         [--use_sa3_as_super_area]

The command options are explained as below:

--workdir: Specifies the directory where the generated data will be stored. For example, --workdir /tmp/june_data.
--cfg: Sets the configuration for retrieving the source data. For example, --cfg etc/june_data.yml.
--scale: Determines the percentage of the population to be used. For instance, a value of 0.1 means only 10% of the population will be utilized. For example, --scale 0.1
--exclude_super_areas: Allows excluding specific super areas from the model. For example, --exclude_super_areas A1 A2.
--disease_cfg_dir: Disease configuration directory. For example, --disease_cfg_dir etc/cfg/disease/covid-19.
--policy_cfg_path: Policy file. For example, --policy_cfg_path etc/cfg/policy/policy1.yaml.
--simulation_cfg_path: Simulation file. For example, --simulation_cfg_path etc/cfg/simulation/simulation_cfg.yml.
--use_sa3_as_super_area: If use SA3 as super area level, otherwise we will use regional council as super area level. Note that SA3 is not a standard statistical level, therefore many information are aggregated from SA2 (e.g., super_area_location).

Note

Most input data are created from raw dataset stored in June_NZ_data (most of them are obtained from NZ.Stat), while some inputs are defined via:

the fixed variable FIXED_DATA (process/__init__.py), or

from external configuration files:

Disease configuration (the directory contains all the information about the population disease, including the viruses we want to investigate), e.g., etc/cfg/disease/covid-19

Policy configuration, e.g., etc/cfg/policy/policy1.yaml

Simulation control configuration, e.g., etc/cfg/simulation/simulation_cfg.yml

Vaccination configuration, e.g., etc/cfg/disease/vaccine/vaccine1.yaml

The following contents show different types of inputs for JUNE_NZ.

1. Population data (demography)

It defines the population (agents) to be used in the model.

Population/demography data
Data	Geography level	Example
The number of people (grouped by age)	area	age_profile.csv
The number of people (grouped by age and ethnicities)	area	ethnicity_profile.csv
The percentage of female (grouped by age)	area	gender_profile_female_ratio.csv

Note

The number of people (grouped by age) determines the number of total people to be used in the model.

2. Geography data

It defines the geography (grid) to be used in the model.

Geography data
Data	Geography level	Example
Area latitude and longitude	area	area_location.csv
Super area latitude and longitude	super area	super_area_location.csv
Super area names	super area	supare_area_name.csv
Super area and area socioeconomic_centile	super area, area	area_socialeconomic_index.csv
Hierarchy of region, super area and area	region, super area, area	geography_hierarchy_definition.csv

3. Group (activities) data

Group data contains different types of activities (e.g., company, household, hospital, school and leisure) that an individual might do every day.

3.1 Company

It defines the companies used in the model

Company data
Data	Geography level	Example
Number of employers by firm size	super area	employers_by_firm_size.csv
Number of employers by sector type	super area	sectors_by_sector.csv
Number of employees by sector and age	area	employees.csv
When company close, who will be the key worker etc.	`NULL`	company_closure.yaml
Sub-sector configuration	`NULL`	subsector_cfg.yaml

In the above data, Number of employers by firm size, Number of employers by sector type and Number of employees are obtained from NZ.Stat, while company clousre and sub-sector configuration are defined in the variable FIXED_DATA. For example,

"company": {
    "employees": {"employment_rate": 0.7},
    "company_closure": {
        "company_closure": {
            "sectors": {
                "A": {"key_worker": 1.0, "furlough": 0.0, "random": 0.0},
                "P": {"key_worker": 0.0, "furlough": 0.0833, "random": 0.9167},
                ...
                "S": {"key_worker": 0.0, "furlough": 0.0, "random": 1.0},
            }
        }
    },
    "subsector_cfg": {
        "age_range": [18, 64],
        "sub_sector_ratio": {"P": {"m": 0.4, "f": 0.6}, "Q": {"m": 0.5, "f": 0.5}},
        "sub_sector_distr": {
            "P": {
                "label": ["teacher_secondary", "teacher_primary"],
                "m": [0.72526887, 0.27473113],
                "f": [0.72526887, 0.27473113],
            },
            ...
            "Q": {
                "label": ["doctor", "nurse"],
                "m": [0.65350126, 0.34649874],
                "f": [0.16103136, 0.83896864],
            },
        },
    },
},

Note

The Number of employees from NZStats somehome is smaller than the expected value compared to the NZ population. Therefore, in FIXED_DATA we have a variable called employment_rate, which is a factor makes number of employees matches to the assumed number of people in employment.

3.2 Household

It defines the household information used in the model

Household data
Data	Geography level	Example
Age difference for parents-children	super area	age_difference_parent_child.csv
Age difference for couple	super area	age_difference_couple.csv
Number of regular household (e.g., with different household composition)	area	household.csv
Regular household defination	`NULL`	household_def.yaml
Number of communal household	area	household_commual.csv
Number of student only household (e.g., dormitory)	area	household_student.csv

The household information come from both external dataset and FIXED_DATA:

For example,

for setting up the age differences between couples and parents-children, we have:

FIXED_DATA = {
    "group": {
        ......
        "household": {
            "age_difference_couple": {
                "age_difference": [-5, 0, 5, 10],
                "frequency": [0.1, 0.7, 0.1, 0.1],
            },
            "age_difference_parent_child": {
                "age_difference": [25, 50],
                "0": [0.1, 0.9],
                "1": [0.1, 0.9],
                "2": [0.2, 0.8],
                "3": [0.3, 0.7],
                "4 or more": [0.3, 0.7],
            },
        },
    ...

where the above defines the assumed age differences for both couples and parents-children.

Note

The number of household are obtained from NZ.Stat. However there is a lack of detailed information, thus the only household type =0 >=0 >=0 >=0 >=0 is used in the model.
We also set the number of commnual and student househodls to zero, since the lack of detailed dataset.

3.3 Hospital

It defines the hospital information used in the model

Hospital data
Data	Geography level	Example
Hospital information (address, number of beds/ICUs etc.)	area	hospitals.csv
Hospital configuration (the age of worker in this sector etc.)	`NULL`	hospital_config.yaml
How many hospital (maximum) a person could visit	`NULL`	neighbour_hospitals.yaml

The information above include the hospital address (latitude and longitude), number of beds and number of ICU beds. Also some affiliated data for hospital, such as the the minimum age working in this sector, and the number of hospitals that an indiviual agent could visit.

3.4 School

It defines the school information used in the model

School data
Data	Geography level	Example
School information (address, student age range)	area	schools.csv

The information would include the school address (latitude and longitude), and the student profile (e.g., min and max age)

3.5 Leisure (cinema, grocery, pub, gym and household visit)

It defines the leisure information used in the model

leisure data
Data	Geography level	Example
Cinema locations	super area	data/cinema.csv
Cinema configuration (e.g., the chance that people may visit)	`NULL`	data/cinema_cfg.yaml
Grocery locations	super area	data/grocery.csv
Grocery configuration (e.g., the chance that people may visit)	`NULL`	cfg/grocery_cfg.yaml
Gym locations	super area	data/gym.csv
Gym configuration (e.g., the chance that people may visit)	`NULL`	cfg/gym_cfg.yaml
Pub locations	super area	data/pub.csv
Pub configuration (e.g., the chance that people may visit)	`NULL`	cfg/pub_cfg.yaml
Household visit configuration	`NULL`	cfg/household_visit_cfg.yaml

The information would include all the leisure activities.

Note that all the location information are obtained from the Open Street Map, while all the configurations are from FIXED_DATA. For example, for cinema, we have:

"pub": {
        "times_per_week": {
            "weekday": {
                "male": {
                    "0-9": 0.032,
                    "9-15": 0.106,
                    ...
                    "86-100": 0.033,
                },
                "female": {
                    "0-9": 0.135,
                    ...,
                    "86-100": 0.02,
                },
            },
            "weekend": {
                "male": {
                    "0-9": 0.038,
                    ...
                    "86-100": 0.063,
                },
                "female": {
                    "0-9": 0.043,
                    ...
                    "86-100": 0.06,
                },
            },
        },
        "hours_per_day": {
            "weekday": {
                "male": {"0-65": 3, "65-100": 11},
                "female": {"0-65": 3, "65-100": 11},
            },
            "weekend": {"male": {"0-100": 12}, "female": {"0-100": 12}},
        },
        "drags_household_probability": 0,
        "neighbours_to_consider": 7,
        "maximum_distance": 10,
    },

The above shows how frequent a person might visit a cinema (over weekdays and weekends), how many different cinemas he/she might consider, and how long he/she might travel (neighbours_to_consider, maximum_distance)to go to a cinema.

4. Commute

Commute defines how people move across different areas

Group/commute data
Data	Geography level	Example
How people travel (commute method) in different areas	area	transport_mode.csv
Define if a travel method is public or not	`NULL`	transport_def.yaml
Number of inter-state stations	super area	number_of_inter_city_stations.yaml
Seat-passanger ratio	super area	passage_seats_ratio.yaml
How people travel across different super areas for work	super area	home_and_workplace.csv

Note that transport_def.yaml is defined in the variable FIXED_DATA, e.g.,

FIXED_DATA = {
    ...
    "group": {
        "commute": {
            "transport_def": [
                {"description": "Work mainly at or from home", "is_public": False},
                {"description": "Underground, metro, light rail, tram", "is_public": True},

                ...

                {"description": "On foot", "is_public": False},
                {"description": "Other method of travel to work", "is_public": False},
            ]
        },
    ...

Note that when we use SA3 as the super_area, the Number of inter-state stations is dependant on the population in each SA3 ~ there will be one additional station when the population increases by 5000. However, when we use New Zealand regions as the super_area, the number of stations is defined in FIXED_DATA.

5. Interaction

It defines the interaction intensity matrix for all the group members (e.g., school, hospital etc.)

interaction data
Data	Geography level	Example
Base interaction intensity and susceptibilities	`NULL`	general.yaml
Cinema interaction matrix	`NULL`	cinema.csv
Company interaction matrix	`NULL`	company.csv
Grocery interaction matrix	`NULL`	grocery.csv
Gym interaction matrix	`NULL`	gym.csv
Hospital interaction matrix	`NULL`	hospital.csv
Household interaction matrix	`NULL`	household.csv
Pub interaction matrix	`NULL`	pub.csv
School interaction matrix	`NULL`	school.csv
Commute (city_transport, and inter_city_transport) interaction matrix	`NULL`	school.csv

The above data are defined through FIXED_DATA. (It is worthwhile to note that when the activity is household_visit, the contact matrix is borrowed from household therefore we don’t need a seperate household_visit contact matrix)

6. Disease data

Defines disease properties (population comorbidities, probability of infection, infection outcome, symtom trajectory, and virus intensity):

Disease/Disease data
Data	Geography level	Example
Comorbidities (prevalence) for female (grouped by age)	`NULL`	comorbidities_female.csv
Comorbidities (prevalence) for male (grouped by age)	`NULL`	comorbidities_male.csv
Comorbidities intensity	`NULL`	comorbidity_intensity.yaml
Probability of infection	`NULL`	covid19.yaml
Infection outcome (the ratio of symptoms)	`NULL`	infection_outcome_ratio.csv
Symptom trajectory timing profile	`NULL`	symptom_trajectories.yaml
Virus intensity	`NULL`	virus_intensity.yaml

6.1 Comorbidities

Comorbidities are defined by the variable FIXED_DATA, which is located in process/__init__.py. The comorbidity is one of the parameters determing the severity of symptom that an individual may experience.

comorbidities_female: the ratio of female have certain comorbidities (grouped by ages)
comorbidities_male: the ratio of male have certain comorbidities (grouped by ages)
comorbidities_intensity: the intensity of the comorbidities

Note

For example, if the average female comorbidity intensity for the age group 50 is 1.02: tt is caculated by [0, 0.1, 0.9] * [0.8, 1.2, 1.0] where [0, 0.1, 0.9] is the ratio of comorbidities and [0.8, 1.2, 1.0] represents the intensities of comorbidities.

If a person has disease2, which has the intensity of 1.2, then the symptom multiplier factor for this person is 1.2/1.02=1.18 which is larger than 1.0, and therefore will lead to higher chance of experiencing severe symptoms.

An example of the defination of Comorbidities is:

"comorbidities_female": {
    "comorbidity": ["disease1", "disease2", "no_condition"],
    5: [0, 0, 1.0],
    10: [0, 0, 1.0],
    20: [0, 0, 1.0],
    50: [0, 0.1, 0.9],
    75: [0, 0.2, 0.8],
    100: [0.9, 0.0, 0.1],
},
"comorbidities_male": {
    "comorbidity": ["disease1", "disease2", "no_condition"],
    5: [0, 0, 1.0],
    10: [0, 0, 1.0],
    20: [0, 0, 1.0],
    50: [0, 0.1, 0.9],
    75: [0, 0.2, 0.8],
    100: [0.9, 0.0, 0.1],
},
"comorbidities_intensity": {"disease1": 0.8, "disease2": 1.2, "no_condition": 1.0},

6.2 Virus intensity

The virus intensity is a parameter that influences the severity of symptoms. As the intensity value increases, the likelihood of an individual experiencing more severe symptoms also increases. This can be achieved by elevating the probability of severe symptoms in addition to the ‘infection_outcome’ input data.”

An example of the virus intensity is:

Covid19: 1.3 # 170852960
B117: 1.5 # 37224668
B16172: 1.5 # 76677444

6.3 Symptom trajectory (infection outcome)

For the symptom trajectory, it is defined by a set of distribution functions (e.g., beta, log-normal etc.). Each distribution function comes with a set of parameters, those parameters decide the timeline for different symptoms during the infection.

The considered symptom stages include:

Recovered (-3)
Healthy (-2)
Exposed (-1)
Asymptomatic (0)
Mild (1)
Severe (2), which is calculated by 1.0 - [ Hospital + Die (from Home) + Asymptomatic + Mild]
Hospital (3)
ICU (4)
Die (from home, 5)
Die (from hospital, 6)
Die (from ICU, 7)

For example, if we need to create a symptom trajectory for Die (from hospital, 6), we need to go through the stages of Exposed (-1), Mild (1), Hospital (3) and Die (from hospital, 6) one by one. Among this trajectory, at the stage of mild (-1), we create samples from a log-normal distribution with a specific, predefined parameters (e.g., shape=0.55, loc=0.0, scale=5.0), a random number is drawn from these samples, and it represents the timing for the infection (or we can understand it as the end time for the stage of symptom).

The chance of having a symptom is determined by:
- Comorbidities (see the Section 4.1 of comorbidities for details)
- Input infection outcome statistics (e.g., the percentage of symptoms that a person may experience, see Sectoin 4.3.1)
- The target virus intnsity (see Section 4.2)
How long the sympton will last is dependant on:
- The symptom trajectory (see Sectoin 4.3.2)

6.3.1 Input infection outcome statistics

An example of the infection outcome statistics is:

Disease/ infection outcome
	gp_asymptomatic_male	gp_mild_male	gp_ifr_male
[0, 50]	0.0	0.3	0.7
[51, 100]	0.3	0.3	0.4

6.3.2 Symptom trajectory (infection outcome)

An example of the symptom trajectory is:

# exposed => mild => hospitalised => dead
- stages:
- symptom_tag: exposed
        completion_time:
        type: beta
        a: 2.29
        b: 19.05
        loc: 0.39
        scale: 39.8

- symptom_tag: mild
        completion_time:
        type: lognormal
        s: 0.55
        loc: 0.0
        scale: 5.

- symptom_tag: hospitalised
        completion_time:
        type: beta
        a: 1.21
        b: 1.97
        loc: 0.08
        scale: 12.9

- symptom_tag: dead_hospital
        completion_time:
        type: constant
        value: 0.

Note that the profile can be plotted using etc/test/plot_<profile>.py, where <profile> is the function name (e.g., beta, norm or lognorm).

6.4 Transmission profile

6.4.1 Base probability of infection

The transmssion profile determins the probability of the infection (e.g, the higher the probabilities, the more infectiousness an infector can be).

The probability of the infection is usually chosen from a Gamma profile, which is defined by (shape,shift,scale). The following figures show the Gamma profile for different shape, shift (loc) and scale. The x-axis is the value of shift (loc), which corresponds to the infection time. The y-axis is the probability of infection.

When a person is infected, the infection time will be applied to the above Gamma function (as x), and then obtain the related probability of infection.

6.4.2 Adjust max infectiousness

The maximum infectiousness from the probability of infection is adjusted with the argument max_infectiousness. For an infector, a random value will be drawn from the lognormal function, and it will be multiplied to the probability of function.

The lognormal is determined by parameters of shape, loc and scale. For example, the following figures show the lognormal profile:

6.4.3 Adjust mild/asymptomatic infectiousness

We can adjust the the probability of infection based on a person’s maximum symptom. For example, if the maximum symtom is asymptomatic, we can reduce the probability of infection profile by 50%.

An example for COVID-19 transmission is set up as:

type:
        'gamma'
shape:
        type: normal
        loc: 1.56
        scale: 0.08
rate:
        type: normal
        loc: 0.53
        scale: 0.03
shift:
        type: normal
        loc: -2.12
        scale: 0.1
asymptomatic_infectious_factor:
        type: constant
        value: 0.5
mild_infectious_factor:
        type: constant
        value: 1.
max_infectiousness:
        type: lognormal
        s: 0.5
        loc: 0.0
        scale: 1.

7. Vaccination data

The vaccine data must be specified if we want to simulate the effect of vaccination campaign in the model.

Disease/Disease data
Data	Geography level	Example
Vaccine features (age dependant)	`NULL`	vaccine.yaml