The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.
Guide to Using Datasets
The basic approach of The DHS Program is to collect data that are comparable across countries. This is primarily achieved through the use of model questionnaires and the subsequent processing of the raw data that has been collected into standardized data formats known as Recode Files - see the DHS Recode Manual for a full description of the files. Recode files are generated using standard recode definitions that define variable names, locations, value categories across countries, and constructs of commonly used variables such as age in five-year groups. The information in this section provides an overview of the recode definitions.
On This Page
Benefits of Standard Data Definitions
Standardized Data Variables
Mapping Data to Questionnaires
Cardinality of Data
DHS data files are transformed into a standardized recode dataset for several reasons:
DHS surveys collect primary data using several types of questionnaires. A household questionnaire is used to collect information on characteristics of the household's dwelling unit, and data related to the height and weight for women and children in the household. It is also used to identify members of the household who are eligible for an individual interview. Eligible respondents are then interviewed using an individual questionnaire.
In a majority of DHS surveys eligible individuals include women of reproductive age (15-49) and men age 15-59, or in some cases 15-54. In some countries only women are interviewed. Individual questionnaires include information on fertility, family planning and maternal and child health. Data are available from DHS for each of these questionnaires.
The DHS Program also collects data using other types of surveys and questionnaires. These include surveys of education, health service providers, communities, household health expenditures, young adults, and others. These data are also available, but there are no recode definitions for them.
DHS strongly suggests that analysts become familiar with the questionnaires used in the surveys they are analyzing. The questionnaires for a survey can be located in the appendix of the final report.
The questionnaires used in one country, while containing essentially the same information, may be different in many ways from those used in another country. In creating the standardized individual recode data files these differences require special consideration and total standardization is obviously not possible. The recode data file is structured in two parts, standard sections and country-specific sections. The standard sections contain the same variables in the same positions for all countries. The country-specific sections contain all variables specific to the country and so are not standardized across countries.
In the “standard recode” datasets, the variable names and definitions are, wherever possible, consistent across all surveys. Special care has been taken to include all variables that are deemed important for each datafile. For example, variables for household characteristics are included in the women, men, and children’s files. However, there are instances when researchers will have to merge or combine different data files to obtain the variables that meet their analysis needs.
Also, each survey is different, with questions that diverge from the standard. There are special records to keep variables that are not part of the model questionnaires but were included in a particular country. These records are known as country-specific records.
It is important to understand that survey questionnaires change frequently over time. For example, the DHS questionnaires have changed significantly since the first phase (DHS I). For this reason, there is a different recode definition for each DHS phase. However, if a variable is present in one or more phases, that variable has the same meaning in every phase where it is present. If a new question is added to the core questionnaire a new variable will be added to the recode definition.
If a question is dropped in a model questionnaire from one phase to another, the name of the variable used for that question is not reusable. The variable will not be present in the recode definition of the phase where it was dropped. However, if the same question is used again in other surveys in the same or later phases, the reserved variable name can continue to be used; but it will be in a different location, and in a country specific datafile section.
Two core questionnaires were used during the DHS surveys in DHS phases I-IV, Model "A" questionnaire for High Contraceptive Prevalence Countries and Model "B" questionnaire for Low Contraceptive Prevalence Countries. The two questionnaires contain basically the same information, although the Model "A" questionnaire contains a detailed calendar of events in the five years preceding the interview, whereas the Model "B" questionnaire contains a simpler series of questions. In DHS phases V and VI a single core questionnaire was used.
In the variable description section of the DHS recode manual, the column labeled "Model" indicates in which questionnaire the question is asked. An "A" indicates that the variable refers to a question asked only in countries that used a Model "A" questionnaire, and a "B" indicates that the variable relates to a question asked only in countries that used the Model "B" questionnaire. If the column is blank, then the question is asked in both Model "A" and Model "B" questionnaires. If the column contains an "X", then the question is not included in either of the Model questionnaires, but was used in a sufficient number of surveys to justify its inclusion as a standard variable. If the column contains "MM", then the questions come from the maternal mortality module. If the column contains "FG", then the questions come from the female genital cutting module.
The data file is broken down into a number of logical sections. These sections translate directly into records for the data structures. The logical sections are designed to map the sections of the model questionnaires, although some sections of the model questionnaire are split into more than one section in the recode data file. Some of these sections are repeating or multiple occurrence sections while others are single occurrence sections. Single sections contain simple, single-answer variables.
Multiple sections are used to represent sets of questions that are repeated for a number of events. The birth history is an example of a multiple section, where questions relating to children are asked for each child, and each child has an entry in the birth history. Each entry in the multiple section is known as an occurrence of the section. In hierarchical data files each occurrence of the section occupies a separate record. Multiple sections are used for sets of questions where the number of occurrences may vary.
In contrast, sets of questions for which there are a fixed number of occurrences are held in a group. A group is similar to a multiple section, but is stored on a single record for hierarchical files. In addition single variables may also be included in a section containing a group. As an example, in the recode file the contraceptive table (REC31) is stored as a group containing 20 entries, one for each contraceptive method. For the flat files there is no difference between groups and multiple sections.
Total standardization of data is not possible and special consideration must be given where differences exist. Each DHS Program dataset is distributed with an associated data dictionary and other related documentation. The DHS Recode Manual is an important resource to reference when working with datasets.
The DHS Recode Manual is comprised of two parts. The first part is a general discussion of the recode file, including the rationale for recoding; description of the physical structure in which the recode file is available; coding standards used in the data file; location of identification information; use of century month codes for dates and imputation of partial dates; DHS model questionnaires; sections and occurrences. The second part provides a description of each variable in the data file, giving additional information that is not available in the dictionary.