Data
Download Datasets

The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.

Guide to Using Datasets
Data Processing

Processing survey data consists of four important steps:
Step 1: Data Entry
Capturing all the information from paper questionnaires and storing it in electronic format

Step 2: Data Editing and Quality Assurance
Editing and ensuring the maintenance of quality data

Step 3: Data Tabulation
Generating tabulations for both a key indicators and a final report

Step 4: Recoding of Datasets
Producing a clean set of data for use by researchers and policy makers

Step 1: Data Entry

This is the process of converting the information on the paper questionnaires to an electronic format. For the DHS surveys, this is done using CSPro, a software package designed and implemented by ICF Macro, the US Census Bureau, and others specifically to process survey and census data. CSPro is freely available from the US Census Bureau's website.

See Also:
Data Collection
DHS Survey Manuals

Step 2: Data Editing and Quality Assurance

One of the primary goals of The DHS Program is to produce high-quality data and make it available for analysis in a coherent and consistent form. Demographic surveys in developing countries are prone to incomplete or partial reporting of responses. Additionally, complex questionnaires inevitably allow scope for inconsistent responses to be recorded for different questions.  For the analyst this results in a data file containing incomplete or inconsistent data, complicating the analysis considerably.

In order to avoid these problems, The DHS Program has adopted a policy of editing and imputation which results in a data file that accurately reflects the population studied and may be readily used for analysis.

The quality of DHS data is assured by several processes:
  • Questionnaires are checked when they first arrive from the field, for the correct numbers of questionnaires and selection of eligible respondents. Responses that are open-ended (such as 'other' responses) or those that require coding (such as occupation) are also coded at this point.
  • All questionnaires are checked after data entry to ensure that all that were expected were in fact entered. The numbers of questionnaires are also checked against the sample design.
  • All questionnaires are entered twice and verified by comparing both data sets. All discrepancies are resolved.
  • The entered data are checked for inconsistencies and where possible, they are resolved. Some missing data, such as dates of events, are imputed where possible.
  • A set of quality control tables is generated on a regular basis. These tables indicate potential problems in the field. The tables include information on response rates, age displacement, and completeness of data. This information is then relayed to the field teams to help them improve the quality of data in the field.
See Also:
Data Quality and Use
Data Tools and Manuals

Step 3: Data Tabulation

Shortly after data entry is completed, the data processing specialist visits the implementing organization in the country again. During this visit, any additional data checking and cleaning is completed, and weights are calculated. This data set is referred to as the "raw" data. Data tabulation is done using CSPro, a software package designed and implemented by ICF Macro, the US Census Bureau, and others specifically to process survey and census data. There are two types of tables that are generated by the data processing team.

The first set consists of the Key Indicators Report tables. These tables are generated from the "raw" data set.  The number of the Key Indicators Report tables is limited and they present the main national key findings of the survey.

The second set consists of the Final Report tables. Production of tables for the Final Report can take several months to complete. The number of the Final Report tables is much larger than the Key Indicators Report and the data is presented in terms of national level statistics as well as for population subgroups and/or administrative or geographic subdivisions.

When appropriate to a topic, further data disaggregation is shown. The first step towards producing the Final Report tables is to generate a "standard recode" data set, which contains the same data as the raw data set, but in a standardized format. It is standardized in that the variable names and definitions are, wherever possible, consistent across all surveys. The "standard recode" is also important for researchers and policy makers since it produces a clean set of data for use. The second step is generating the actual Final Report tables. If possible, a preliminary set of the Final Report tables is generated in the time remaining during this country visit, while the complete set is generated at Macro.

The Final Report tables are produced according to a set of standard tables, or a Tabulation Plan, which is established beforehand by the country manager and the survey specialists in country. The purpose of the Tabulation Plan is to provide model tables which set forth the major finding of the survey in manner that will be useful to policy makers and program managers. It also helps provide guidance concerning the most important indicators that should be presented in the survey report, the level of analysis expected and ensures timely dissemination of survey results.

See Also
Data Tabulation Plan

Step 4: Recoding of Datasets

The DHS Program makes the resulting survey datasets freely available to researchers, policy and decision makers. In order for the datasets to be clean and as comparable as possible across all surveys, The DHS Program generates "standard recode" datasets, which contain the same data as the raw datasets, but in a standardized format. In the "standard recode" datasets, the variable names and definitions are, wherever possible, consistent across all surveys. However, each survey is different, with questions that diverge from the standard. These questions are included in the standard recode datasets, either as computed standard variables or variables that are specific to that survey. The process of recoding can take several months and it involves consistency checking and comparisons between the standard recode and raw datasets.

Recoding of datasets is currently done for the DHS and AIS surveys. Work is currently in under way to recode the SPA datasets.

Recoding of DHS and AIS datasets
There are three core questionnaires in DHS surveys: the Household Questionnaire, the Women's Questionnaire, and the Men's questionnaire. There are also several standardized modules for countries with interest in other topics, such as malaria, domestic violence or maternal mortality. All additional modules are incorporated into the Household, Women's, or Men's questionnaires. There are two core questionnaires in the AIS surveys: the Household questionnaire and the Individual Questionnaire. The latter applies to women and men as well.

Since the survey methodology, sampling and eligibility of the DHS and AIS surveys are consistent, the DHS recode variables have been expanded to include the AIS variables as well.

Since the very beginning of DHS a recode file was designed for the sake of consistency and comparability across surveys. In the first phase of the DHS (DHS-I) the recode was defined only for the Women's Questionnaire. The recode file proved to be very useful and as a result since DHS-II, a recode file was introduced for the Household and the Men's questionnaires.

Recode files are initially created using a hierarchical model and later exported to flat files. There are two physical recode hierarchical data files. The first one includes the Household and Women's Questionnaire and the second one is for the Men's Questionnaire. The hierarchical data file is broken down into a number of records. The records were originally designed to map different sections of the model questionnaires, but because of changes among phases that is not the case anymore. Some of these records are repeating or multiple-occurrence records while others are single-occurrence records. Single records contain simple, single-answer variables. Multiple records are used to represent sets of questions that are repeated for a number of events.

There are special records to keep variables that are not part of the model questionnaires but were included in a particular country. These records are known as country-specific records and they can also be multiple or single depending on whether the question was added to a single or multiple section in the questionnaire.

See Also:
Data Tools and Manuals
Using Datasets for Analysis