Download Datasets

The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.

Guide to Using Datasets

FAQs

Frequently asked questions

  • About DHS (3)  
  • Sampling and Weighting (7)  
  • Publications (3)  
  • GPS Data (17)  
  • Using Data Files (4)  
  • STATcompiler (2)  

About DHS

  • What is The DHS Program’s relationship to USAID and ICF International?
  • The DHS Program is a USAID-funded project implemented by ICF International. The project was previously implemented by ICF/Macro, Macro International Inc., ORC Macro, and Institute for Resource Development, Inc.). The DHS Program is housed at ICF International's office in Rockville, Maryland USA. Several other organizations are partners on The DHS Program as well. In the current phase of the contract, The DHS Program partners are Johns Hopkins University Bloomberg School of Public Health/Center for Communication Programs (Hopkins CCP), PATH, Futures Institute, Vysnova, Blue Raster, Kimetrica, and EnCompass.

  • How are DHS countries chosen?
  • DHS surveys are generally carried out at the request of the USAID mission or another international donor in a given country. DHS surveys are carried out only in less-developed countries, or countries receiving US foreign aid. ICF International does assist non-USAID funded countries to implement surveys, using funding from the countries themselves or from a non-USAID donor. Such surveys use all standard The DHS Program protocols and materials.

  • How does a country which has never had a survey get one started?
  • If you are interested in having a DHS in your country, please contact the local USAID office. If there is no USAID office in the country, consult other major funders, such as UN agencies, or the international aid programs of other foreign countries.

Sampling and Weighting

  • What are sample weights?
  • In many countries, the population is not evenly distributed among different regions. Over-sampling in regions with small populations ensures that they have a large enough sample to be representative. Under-sampling is done in regions with large populations to save costs. Sample weights are mathematical adjustments applied to the data to correct for over-sampling, under-sampling, and different response rates to the survey in different regions.

  • How are the oversampled/undersampled areas corrected in data analysis?
  • The samples for DHS surveys are designed to permit data analysis of regional subsets within the sample population. When the expected number of cases for some of these regions is too small for analysis, it is necessary to oversample those areas. When the expected number of cases for some of these regions is unnecessarily large, those areas may be undersampled to accommodate logistical or budgetary constraints.

    During analysis, it is then necessary to "weight down" the oversampled areas and "weight up" the undersampled areas. The developing of the sampling weights has taken this factor into account. Always use the weight variable found in the DHS data set. Even in surveys that come from a self-weighting sample, it is still necessary to use the sampling weights in analysis because the response behavior may differ by response groups.

  • What does it mean to normalize the weights?
  • After the weights are initially calculated, they are normalized, or standardized, by dividing each weight by the average of the initial weights (equal to the sum of the initial weight divided by the sum of the number of cases) so that the sum of the normalized/standardized weights equals the sum of the cases over the entire sample. The standardization is done separately for each weight for the entire sample.  

    The entire set of household sample weights is multiplied by a constant, thus, the total weighted number of households equals the total unweighted number of households at the national level.

    Individual sample weights are normalized separately for women and men. Thus, the total weighted number of women equals the total unweighted number of women, and the total weighted number of men equals the total unweighted number of men. Women and men are normalized separately because all non-HIV calculations are performed on women and men separately. We do not provide survey estimates on the joint population of women and men combined for anything other than HIV prevalence.

  • Why do we normalize the HIV weights on the total population of women and men combined?
  • HIV data are pooled data from men and women individual surveys. We calculate a joint estimate of HIV prevalence among women and men combined so it is necessary to normalize the HIV weights on the population of women and men combined. The HIV weights are normalized on the total population of women and men combined by multiplying the entire set of weights by a constant.

  • Do I need to weight the data?
  • DHS surveys require the use of sample weights during analysis to ensure sample representativity. In DHS Standard Recode files, the household sample weight variable is HV005, the woman’s individual weight variable is V005, and the man’s individual weight variable is MV005. The weight is an eight-digit variable with six implied decimal places. Always divide the weight variable by 1,000,000 before applying it because the weight variable was multiplied by 1,000,000 in the recoding procedure.

    For example, when V005 = 170722, the weight = 170722/1,000,000 = 0.170722.
    When V005 = 5809147, the weight = 5809147/1,000,000 = 5.809147.

    Here is an example of code for an SPSS program in which the results are to be weighted:
    COMPUTE WTVAR=V005/1000000.
    WEIGHT BY WTVAR.

  • What sample weight should be used for couples?
  • DHS surveys identify eligible households and individuals within those households. Surveys do not identify eligible couples. Since there is no way of knowing how many couples are in the sample, it is not possible to calculate a separate sample weight for couples. A proxy couples' weight must be selected from either the men's individual sample weight or the women's individual sample weight. The base for both of these weights is the household weight, and where response rates differ little by sex, there is very little difference between these two weights. Response rates to population-based surveys tend to be lower among men, so the practice of the DHS project is to use the men's sample weight for couples.

  • Why is there a special domestic violence weight?
  • The domestic violence module is applied to no more than one woman per household, selected at random. A special weight is calculated to accommodate the respondent's special probability of being selected into the subsample of domestic violence respondents. The weight is a six-digit variable D005, coded as the other weight variables in the recode file.

Publications FAQs

  • I cannot view the PDF on the web - what can I do?
  • You will need the free download Adobe Acrobat Reader installed in order to view PDF documents. If you are viewing the PDF in the web browser and you cannot see it, try downloading the PDF onto your computer or try these solutions.

  • How much do The DHS Program publications cost?
  • All DHS Program publications are free of charge. There is usually a limit of 10 copies for printed publications.

  • How do I get a copy of the publication?
  • First go to Publications Search to find the document you want. You can search by topic, country, key word, or publication type. There are 10 categories within publication type. Electronic versions of all publications are available for download. If available, hard copies can be requested by clicking on "Order a Hard Copy" link.

GPS Data FAQs

  • [General] Why do we collect GPS data?
  • The location of the survey allows hemoglobin levels to be adjusted according to altitude. This provides a more accurate measure of anemia status. The GPS coordinates also allow you to augment survey data with other data based on location, e.g. malaria parasitemia, rainfall, or land cover.
  • [General] When do the GPS data become available online?
  • The GPS data are available online after the survey data are released unless there are outstanding issues. The GPS data are only released after they have been thoroughly checked and evaluated by the GIS Team. Survey data may be available before GPS data. However, GPS data will never be available before survey data.

  • [General] What information is provided in the GPS data?
  • The geographic datafile contains the DHS cluster ID, DHS country code, FIPS country code (CCFIPS), FIPS region name and code (if applicable), SALB region name and code (if applicable), DHS region name and code, source type (e.g. GPS, MIS, CEN, GAZ), urban or rural classification, latitude and longitude coordinates in decimal degrees, altitude in meters from either the GPS receiver and/or a digital elevation model (DEM), and the geographic datum. For more a more detailed description, see the README file included with the GPS data downloads.

  • [Accessing the Data] Are GPS data for SPA available?
  • No, the informed consent agreement assures survey respondents that their facility will not be identified. Geographic location is considered personally identifiable information and their release would violate the informed consent agreement.

  • [Data Format] In what coordinate reference system are the GPS data?
  • All available GPS data reference the World Geodetic System 1984 (WGS84) and contain latitude and longitude values in decimal degrees.

  • [Data Format] What are the sources of coordinate information?
  • Most surveys utilize a field GPS receiver. During the survey, teams collect the coordinates at the approximate center of the cluster. In the GPS data, the SOURCE will be marked as "GPS". If the country maintains robust geographic information about the primary sampling units, usually a census enumeration area, the GIS Team will match these data with the sample frame. These clusters' SOURCE will be marked as "CEN". As a last resort, external data sources are used to approximate a cluster's location using information about the nearest town. These clusters' SOURCE will be marked as "GAZ".

  • [Data Format] Why are some clusters missing coordinate information?
  • If the GIS Team is not able to satisfactorily verify the coordinates, these clusters will be marked as missing GPS data. Clusters without coordinate information may be missing from the data set or located at latitude/longitude 0, 0. These clusters' SOURCE will be marked as "MIS". Every attempt is made to ensure that at least 95% of clusters have coordinate information.

  • [Using the Data] How do you map GPS data?
  • Older datasets are provided as a dbf table. Most geographic information system (GIS) software can display the latitude and longitude data as points. More recent GPS data are provided as a shapefile projected using the WGS84 datum. A shapefile is an open standard and can be used with most GIS software.

  • [Using the Data] How do you link the GPS data to the survey data?
  • The GPS data contains a cluster ID (DHSCLUST ) that corresponds to the cluster ID (hv001) in each of the survey datasets (household, individual, child, birth, etc.). You can link these data using DHS CLUST from the GPS data and hv001 in the survey datasets. Note that because many household and individuals are assigned the same cluster number, the GPS data will correspond to the survey data in a one-to-many relationship.

  • [Using the Data] Why do GPS data sometimes appear to be located outside of my boundaries or in incorrect regions?
  • Every attempt is made to use the most accurate administrative boundaries when checking the data. However, using a generally accepted and commonly used standard is not always possible. All GPS data, even after geographic displacement, should be located within the survey country and within the assigned region. If your administrative boundaries are not identical to the ones the GIS Team used, there may be some clusters that are located across an administrative boundary.

  • [Geographic Displacement] Why are GPS data geographically displaced?
  • The actual location of the cluster is displaced to protect the confidentiality of the survey respondents. The confidentiality agreement states that information will be kept strictly confidential. Because geographic location is considered personally identifiable information, these data must be treated in a way that ensures that respondents cannot be identified.

  • [Geographic Displacement] How are GPS data geographically displaced?
  • The data are randomly displaced up to 5 kilometres in rural areas and up to 2 kilometres in urban areas. A further 1 percent of rural clusters are displaced up to 10 kilometres. Urban or rural classifications are determined by the country's implementing agency. Geographically displaced data should remain located within the country boundaries and within the assigned DHS Region.

  • [Geographic Displacement] Are non-displaced GPS data available?
  • No, all survey data are geographically displaced in order to protect the privacy of survey respondents. All research both internal and external use the displaced data.

  • [Geographic Displacement] Does the displacement affect spatial analyses?
  • Internal studies have shown that the random geographic displacement does not have an effect on analyses conducted at the proper scale, i.e. DHS Region. Since the displacement is random, any error introduced to the data should not be significant. However, if the affect is highly localized, the error could be significant. For example, exact distances from the cluster coordinates should not be calculated.

  • [Geographic Displacement] Can you calculate distance measurements with GPS data?
  • We suggest that people consider catchment areas or buffer zones around the location of interest and consider chunks of distances such as 0-5 km, 6-10 km, etc. This helps reduce some of the error and avoids the pitfall of an exact distance measurement.

  • [Geographic Displacement] Can I calculate indicator estimates for areas smaller than the DHS Region?
  • The survey design for DHS is not conducive for small area estimation. Households and respondents were selected in order to produce representative population estimates at the national and regional level only. Any sub-regional estimates are highly unreliable and likely to result in large standard errors.

  • [Geographic Displacement] Is it possible to do spatial analysis of DHS at the individual cluster level?
  • No, the sample frame is designed to ensure that the data are representative at the national and sub-national level, i.e. DHS region, only. The GPS data for the cluster can be used to extract additional information based on location but are not representative of the population living at that exact place.

Using Data Files

  • How do I weight the data?
  • See Step 7: Use sample weights.

  • Which weight variable should I use?
  • There are different weights for different sample selections/units of analysis. See Step 7: Use sample weights.

  • Should I use sampling weights for regression analyses?
  • There are divergent opinions on whether or not sampling weights should be used when estimating relationships, such as in regression analyses. Advocates for using sampling weights in regressions believe that analyses aren't nationally representative unless sampling weights are used. Researchers who believe regression analyses should be unweighted believe that sampling weights are inappropriate when estimating relationships at the individual level and should not be used. The use of sampling weights for regression analysis is an analytic decision best made by the researcher.

  • How do I specify the stratification and clustering using SVYSET in Stata or COMPLEX SAMPLES in SPSS in order to account for sample design?
  • Most statistical software packages, like Stata and SPSS, assume that the data you are using come from a simple random sample (SRS) unless told otherwise. DHS data are almost always collected using a two-stage stratified cluster sample, not SRS.

    In Stata and SPSS, you can adjust your sampling errors to account for DHS' sample design with three pieces of information: a) the sampling weight, b) the cluster, and c) the stratification used in sample design.
    a)  Sampling weights are described under Step 7: Use sample weights.
    b)  The cluster, or primary sampling unit (PSU), is v021 in women’s and children’s files (IR, KR, and BR), mv021 in Men’s (MR) files, and hv021 in household (HR or PR) files. (See more about data file types).
    c)  Stratification is a bit more complicated. In some surveys the stratification used to design the sample is captured in the variable v023. However, in other surveys, v023 is blank or is set to 0 “National.” We recommend reading the descriptions of the survey implementation in your survey’s final report (see the Introduction section or the Sample Design and Implementation Appendix) and use the stratification described there as your stratification. In Stata, this requires creating a variable to replicate the sampling stratification used in the specific survey you are working on, and specify this as your strata variable in Stata (see example code below).

    An example for stratification: Azerbaijan 2006 DHS.
    The Sample Design Appendix says “Stratification was achieved by separating each economic region into urban and rural areas. The 10 regions were stratified into 19 sampling strata because Baku has only urban areas." This means that the stratification is basically a cross of urban/rural (v025) and economic region (v024). This is a common stratification for DHS samples.

    In Stata, you need to generate a strata variable by generating one variable with unique values for urban and rural areas within each region (only urban areas in Baku).
    Example Stata code:
    *generate weight
    generate weight = v005/1000000
     
    *make unique strata values by region/urban-rural (label option automatically labels the results)
    egen strata = group(v024 v025), label
    *check results
    tab strata
     
    *tell Stata the weight (using pweights for robust standard errors), cluster (psu), and strata:
    svyset [pweight=weight], psu(v021) strata(strata)
     
    In SPSS, you can use the separate variables v024 and v025 without creating a strata variable.  

    *FIRST, compute weight.
    COMPUTE WEIGHT=v005/1000000.
    *Then, go through the Complex Samples drop-down menus:
    Analyze-> Complex Samples -> Prepare for analysis
    *Create a new plan file, press continue. Give your new sampling plan a name.
    Fill in variables for cluster (v021), strata (v024 v025), and weight (weight).

STATcompiler

  • Why are my country’s data not yet in STATcompiler?
  • Each indicator in STATcompiler is calculated according to a standard definition. It takes time to apply the standard definition of each indicator to an individual survey's data files. Once this is done, data are loaded to STATcompiler.

  • Why are the numbers in STATcompiler sometimes different from the numbers in the final report?
  • STATcompiler presents standard indicators calculated the same way, whenever possible, for all surveys. The indicators reported in survey reports may include country-specific adaptations that differ from the standard calculation used in STATcompiler. For example, countries may define "trained health care provider" differently.