Data
Download Datasets

The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.

Guide to Using Datasets
Merging Datasets

For the DHS surveys, The DHS Program distributes separately Household, Household Member, Women, Children under five, Men, and Couples files in flat or hierarchical formats. Care has been taken to include the variables that are deemed important for each of these files. For example, variables for household characteristics are included in the women, men, and children's files.

However, there are instances when researchers have to merge or combine different files to obtain the variables that meet their analysis needs. This section discusses the variables and mechanisms that can be used to accomplish that task.

On this page

File Relationship Types

It is important to mention again that matching files is only necessary when variables required for the analysis are not present in the distributed file but are present in any other file. When merging data files its important to know the type of relationship that exists between the files to be merged as well as the type of output file desired (unit of analysis).  There are two types of relationships: The first is that of one entity related to many other entities [1 : 0-N] and the second is that of one entity related to just one other entity [1 : 0-1].

An example of a relationship of one to many entities can be found between households and women or men. There may exist zero or several women or men questionnaires for each household. An example of a relationship of one to one can be found in the relation existent between women and men. In a monogamous country, there may be zero or one man questionnaire for each woman if she is currently married.

Unique Case Identifiers

One of the advantages of processing complex surveys with CSPro, a software capable of handling hierarchical files, is that it allows to tightly control the case identifiers. DHS guarantees that their files can be matched seamlessly whenever a relationship is possible. To properly manipulate the files it is necessary to know what the variables or fields that identify the cases are. The following reference table shows those fields.

Unique Case Identifiers for Data Files

Matching Variables

When merging files it is generally easier to use the original variables rather than the ID variables. For example, it is not possible to merge the household and women’s files using HHID and CASEID because CASEID has three extra characters identifying the women’s line number. The files can be more easily merged using variables HV001 with V001 and HV002 with V002.

The following reference table shows the variables required to match different files. In the rows, the base files are listed. In the columns, the secondary files along with the variables to be used as keys or matching variables are listed. In the cells intersecting the rows and columns, variables from the base files used to match the secondary file are listed.

Matching Variables

This table shows that household variables can be appended to women, men and children. Women variables can be appended to their children. They also can be appended to men, to create couples. Notice that there is no relationship between children and men because children come from the birth history, which is asked to women.

With software that requires the variables that are used for merging to have the same name in both files it will be necessary to either rename or to create copies of the matching variables in one file to match the names in the other file being used.  For example, to match the household data to the women's data, first rename HV001 to V001 and HV002 to V002, or create a copy of HV001 in V001 and a copy of HV002 in V002 in the household data before merging.

Steps for Merging Datasets

All statistical packages (SPSS, SAS, STATA) have commands that allow merging files, but regardless of the package the following steps are necessary:

  1. Determine the common identifiers (identification variables).
  2. Sort both data files by the identification variables.
  3. Determine the base (primary) file. The base file establishes the unit of analysis.
  • Normally, when the relationship is that of one to many [1:0-N], the base file is the one with the many entities. For example, if merging data from households and women, the base file should be the women’s file. The reason is that you may want to assign to every woman the characteristics of her household. If the match is done the other way around, once the program matches the first woman it will not look for another woman or it will give an error for finding duplicate cases. In the case of matching women and children, the base file should be the children’s file. That way, mothers’ characteristics are assigned to children.
  • If the relationship is that of one to one [1:0-1], the base file is normally the one with the least number of cases. In DHS, men's questionnaires are only applied to a sub-sample of households. This means that not all currently married women have a match with a men's questionnaire. In this case, the base file should be the men's questionnaire and the resulting file (unit of analysis) will be the Couples file.

    4.   Finally, using the right commands depending on the software to be used, the files will be merged.