6. Complex Survey Data

What is complex survey data

Complex survey data refers to data that have been collected according to an explicit sampling scheme that deviates from a simple random sample (where everyone in a target population has the same chance of being selected into a sample). This might be done to save data collection resources, or to make sure that there are sufficient numbers of a small group in the sample.

Due to their advantages, complex sampling designs are very common in large-scale survey data. These datasets usually provide variables relating to their complex sampling designs so that users can take them into account in their analyses.

Datasets might include stratification, clustering, and/or sampling weights (sometimes called ‘survey’ or ‘design weights’) which correspond to the different characteristics of complex sampling. Stratification is when the population is divided into relatively homogeneous groups and a number of pre-determined units is sampled from each. Stratification variables indicate to which stratum a case belongs. For example, a population might be divided into areas with different levels of deprivation and units sampled within each of those to help ensure that different levels of deprivation are well represented. Clustering is when units are sampled in ‘clumps’, for example, rather than randomly sampling households within a whole large area, people within a subset of neighbourhoods within that area might be sampled. Clustering variables thus indicate to which cluster a case belongs. Sampling weights reflect the unequal chances of people being selected into a sample. Cases that had a high probability of selection will have low weights and people who had a low probability of selection will have high weights.This means that when it comes to the analysis, under-represented groups (relative to the population) will be correspondingly up-weighted and over-represented groups will be down-weighted. Note that this is quite analogous to how weights work in IPTW, except that in that case the weighting is to do with making two groups more similar to each other rather than making a sample more similar to an underlying population.

If a dataset has been collected using a complex survey sampling design it is important to take this into account in analyses of those datasets if the research questions relate to the underlying population that the sample was taken from. For example, ignoring clustering will tend to give standard errors that are too small and will inflate statistical significance. Ignoring sampling weights can result in the wrong values for the treatment estimates.

DigiCAT currently implements a ‘design-based’ approach to dealing with complex survey data. This means that it uses information about how the data were sampled (I.e., any sampling, stratification, and clustering variables) when doing the analysis. It specifically uses a ‘pseudomaximum likelihood estimation’ (PML) approach with a Taylor Series Linearisation. PML is a form of weighted estimation that replaces the usual sample statistics usually used with a weighted version. Taylor Series Linearisation provides a correction to the standard errors.

Complex survey data in PSM

When estimating the population ATT (PATT) it is important to take the survey design into account in the outcome model. This means that any stratification, clustering, and weighting variables must feature in the linear regression outcome model that comes at the end of the PSM workflow. It does so using the PML and Taylor Series Linearisation solution mentioned above.

Previous research has been less clear on whether complex survey design variables need to be taken into account at earlier stages of the PSM workflow, namely when estimating the propensity model and when assessing balance. Some authors have made conceptual arguments against fitting design-adjusted propensity models and instead including survey weights as a covariate (DuGoff et al., 2014). This is based on the idea that propensity scores are inherently concerns with samples rather than populations and so there is no need to generalise them to a population. When researchers have used simulation studies to explore this issue they have found inconsistent evidence and this has led some to propose that it doesn’t really make an important difference (Ridgeway et al. 2015; Lenis et al., 2019; Austin et al., 2018). However, one quite comprehensive simulation study did find that for continuous outcomes like the ones that can be studied with DigiCAT there were benefits to fitting a design-adjusted model for propensity score estimation and for taking complex survey variables into account when assessing balance (Austin et al., 2018). This was further supported by another comprehensive simulation study (Lenis et al., 2019 ). For this reason, in DigiCAT if users supply complex survey variables, these are also taken into account when estimating the propensity score with a logistic regression (using a weighted regression) and when assessing matching variable balance (using weighted standardised mean differences). The way that they are taken into account in the logistic regression is very similar to the way they feature in the linear regression outcome model.

Missing data and complex survey data

It is very common in practice for datasets to have both missing data AND complex survey variables. This can be quite a complicated situation to deal with; however, because it is such a frequent feature of real data we provide options for dealing with this in DigiCAT. First, we allow for non-response weights to be used. Non-response weights are weights that take into account the fact that some cases are not likely to be missing totally (or completely) at random. Instead, maybe people who have the highest levels of mental health issues dropped out of a sample. Provided that the risk of non-response can be estimated from available data, non-response weights can be produced and used to up-weight those who are less likely to have provided outcome data and down-weight those who are more likely to. Have. This helps deal with bias due to non-response. When there are sampling weights and non-response dataset providers often produce a combined ‘sampling + non-response’ weight that takes into account both the unequal probabilities of sampling and the non-random non-response. Details of these variables are typically provided with dataset documentation. DigiCAT allows users to supply these weights to be used in an analysis (perhaps alongside stratification and cluster variables) in the same way that sampling weights (see above) are used.

DigiCAT also offers the option of multiple imputation with complex survey design variables. Here the workflow is similar to the usual multiple imputation workflow but the design variables are included as features in the imputation model and adjusted for in the outcome model.