- ABOUT THE DATA
- UNDERSTANDING SURVEY STATISTICS
- WORKING WITH DATASETS
- DATASET ACCESS
DHS surveys use nationally representative samples to estimate important demographic and health indicators. This means that the sample selected for the survey is meant to represent a larger population. Sampling weights are adjustment factors applied to each case in tabulations to adjust for differences in probability of selection and interview, due to either survey design or nonresponse.
In most DHS surveys the sample is selected with unequal probability to expand the number of cases available (and hence reduce sample variability) for certain areas or subgroups for which statistics are needed.
For more on weights, visit our see Chapter 1 of the Guide to Statistics DHS-7.
For help coding weights using Stata or SPSS, visit The DHS Program Code Library on Github.
Yes! If you are tabulating data, weights need to be applied to produce the proper representation.
There are four main sampling weights in DHS surveys:
1) Household weights
2) Household weights for the men's subsample
3) Individual weights for women
4) Individual weights for men
The household weight (the variable name is hv005) for a particular household is the inverse of its household selection probability multiplied by the inverse of the household response rate in the stratum.
The household weight for the men's subsample (hv028) for a particular household is the inverse of its household selection probability for the subsample multiplied by the inverse of the household response rate for the subsample in the stratum.
The individual weight for women (v005) is the household weight (hv005) multiplied by the inverse of the individual response rate for women in the stratum.
The individual weight for men (mv005) is the household weight for the men's subsample (hv028) multiplied by the inverse of the individual response rate for men in the stratum.
There may be additional sampling weights for sample subsets, such as for domestic violence (dv005) and HIV testing (hiv05)
Sample design weights are produced by the DHS sampler using the sample selection probabilities of each household and the response rates for households and for individuals. The initial design weights are then normalized by dividing each weight by the average of the initial weights (equal to the sum of the initial weight divided by the sum of the number of cases) so that the sum of the normalized weights equals the sum of the cases over the entire sample. The normalization is done separately for each weight. Data users do not need to do any of these calculations but should apply weights when tabulating results for proper representation.
Weights can be applied when tabulating data with a statistical software, such as Stata, SPSS, or R.
Weights are calculated to six decimals but are presented in the standard recode files without the decimal point. They need to be divided by 1,000,000 before use to approximate the number of cases.
Sampling weights can be applied in two main ways:
1) Simple Weighting - A simple application of weights (v005, hv005, or mv005) when all that is needed are indicator estimates. Estimates using simple weighting are accurate, however any standard error or confidence interval using simple weighting assume a simple random sample and do not take into account the complex sample used in DHS surveys. If standard errors and confidence intervals are needed, you must account for complex sample design. For examples of using weights in R, SPSS, or Stata, see Chapter 1 of the Guide to Statistics DHS-7.
2) Accounting for Complex Sample Design - Complex sample parameters are needed to produce accurate standard errors, confidence intervals or hypothesis testing for the indicator. Three pieces of information are required:
- The primary sampling unit variable, typically v021 (or hv021 or mv021)
- The stratification variable, typically v022 (or hv022 or mv022)
- The weight variable, v005 (or hv005 or mv005) divided by 1,000,000.
For examples of using weights in R, SPSS, or Stata, see Chapter 1 of the Guide to Statistics DHS-7.
Visit the Guide to DHS Statistics DHS-7, where weights and sample design are explained in depth in Chapter 1 and the appropriate weight variables are included in each indicator explanation in the subsequent chapters.
Read Chapter 1 of the final report for the survey you are interested in for an explanation of the survey design, or read Appendix A of the final report for details of the sampling design. Final reports can be found on The DHS Program website, where you can search by country, survey type, or year.
Go to Github.com to utilize The DHS Program Code Library, where there is Stata and SPSS code freely available to code any standard DHS indicator. This code utilizes weights to create accurate estimates. The code does not account for sampling design so no confidence intervals are produced.
Didn't find what you were looking for? Search for previously asked questions in the Weighting Data section on The DHS Program User Forum or post your own question where other users of DHS Program data can see your posts and respond.
The DHS Program distributes separate datasets for households (HR), household members (PR), women’s (IR), births (BR), children under five (KR), men’s (MR), and couples (CR) files.
Care has been taken to include all variables deemed important for each of these files. For example, variables for household characteristics are included in the women, men, and children’s files. However, there are instances when researchers need to merge or combine different files to obtain the variables that meet their analysis requirements.
The DHS Program makes considerable effort to ensure that files can be matched seamlessly whenever a relationship is possible. To properly manipulate the files, it is necessary to know the variables that identify cases in each file. To see a table of all variables that can be used to merge datasets and examples of merging in Stata and SPSS, see Chapter 1 of the Guide to Statistics DHS-7.
When merging data files, it is important to know the type of relationship that exists between the files to be merged as well as the type of output file desired (unit of analysis). There are two main types of relationships: The first is that of many entities related to one entity (m:1) and the second is that of one entity related to just one other entity (1:1). It is possible that some cases in a data file will not have a relationship with a case in the file you want to merge with. These cases will remain unmatched.
An example of this can happen when merging the household members file (PR) to the children’s file (KR). This is a one-to-one relationship: children of interviewed women are in the KR file and the children’s records as household members are in the PR file. There will be at most one record in the PR file for a child in the KR file, but there could be no record in the PR file if the child has died or the child does not live in the household and those cases will remain unmatched.
Visit the Guide to DHS Statistics DHS-7, where matching and merging datasets and dataset structure are explained in depth in Chapter 1. There are also examples of merging codes using Stata and SPSS.
Didn't find what you were looking for? Search for previously asked questions in the Merging Data Files section on The DHS Program User Forum, or post your own question where other users of DHS Program data can see your posts and respond.
The DHS calendar is a month-by-month history of certain key events in the life of the respondent for the calendar period preceding the date of interview. It is sometimes known as the reproductive calendar or the contraceptive calendar as the main information collected in the calendar relate to reproduction and contraception. Visit the DHS Contraceptive Calendar Tutorial for in-depth modules, Stata and SPSS code, and instructional videos designed to help DHS data users understand and use Contraceptive Calendar data.
Date of key events are often presented as century month codes (CMC) in DHSs data files. These are calculated by taking the difference between the year of an event and 1900, multiplying by 12, and adding the month of the event:
CMC = (Year - 1900) * 12 + Month
CMC works well for most DHS analyses, but there are three countries with DHS surveys that use non-Gregorian calendars. To date these countries are Ethiopia, Nepal, and Afghanistan.
All calculations with CMCs in these surveys work as they do for any other survey. The exception is any analysis requiring specific years, in which case adjustments must be made to the calculated CMC as described above. It should be noted, however, that these are approximate adjustments as the calendars start in the middle of months and dates of events to the day would be required to calculate exact adjustments.
Visit Chapter 1 of The Guide to DHS Statistics DHS-7 for details about each of these calendars and how to adjust CMC dates to the Gregorian calendar if necessary.
To learn more about date variables, see Recode Files in Chapter 1 of the Guide to DHS Statistics DHS-7.
In DHS datasets there are several special values that have particular codes. Two of them are very important - not applicable and missing. The DHS Program treats these two differently although some software treat them as the same.
- "Not applicable" is defined as when a question is not supposed to be asked due to the flow of the questionnaire. For example, question 227 in the Woman's Questionnaire "How many months pregnant are you?" is not applicable if the answer to the preceding question 226 "Are you pregnant now?" is No or Unsure. Question 227 would be left blank in the questionnaire in this case.
- "Missing" is defined as a variable that should have a response, but because of interview error the question was not asked. For example, question 227 "How many months pregnant are you?" should be answered if a woman responded Yes to question 226 "Are you pregnant now?" If the interviewer incorrectly left the question blank then a code is required to recognize that. The general rule for DHS data processing is that answers should not be made up, and so a "missing value" will be assigned. The data will be kept as missing in the data file and no imputation for this kind of question will be done. Missing values in general are codes 9, 99, 999, 9999, etc. depending on the number of digits used for the variable.
To learn more about these values or how The DHS Program approaches missing values in denominators see Chapter 1 of the Guide to DHS Statistics DHS-7.