"Income Statistics Division, Statistics Canada" must be credited when reproducing or quoting any part of this document.
1. Introduction
1.1 General information
1.1.1 Background
This public-use microdata file presents data from the 2002 Survey of Household Spending (SHS) conducted in January through March 2003. Information about the spending habits, dwelling characteristics and household equipment of Canadian households during 2002 was obtained by asking people in the ten provinces to recall their expenditures for the previous calendar year (spending habits) or as of December 31 (dwelling characteristics and household equipment).
Conducted since 1997, the Survey of Household Spending integrates most of the content found in the Family Expenditure Survey and the Household Facilities and Equipment Survey. Many data from these two surveys are comparable to the Survey of Household Spending data. However, some differences related to methodology, to data quality and to definitions must be considered before comparing these data. See Section 1.1.4 "For further information".
1.1.2 New for 2002
The detailed age of the reference person and spouse have been discontinued on the public-use file. Age groups, however, continue to be part of the file. The tenure of the previous dwelling of the spouse is no longer asked as part of the survey. The following nineteen new variables were added at the request of Canada Mortgage and Housing Corporation (CMHC):
CONDODEV | Dwelling is part of a condominium development |
OPFARM | Operated a farm |
APTDWG | Apartment in the dwelling |
NUMFLR | Number of floors in the dwelling |
RPPRDWTY | Type of dwelling previously occupied by reference person |
RPPREFLR | Number of floors in dwelling previously occupied by reference person |
RENTOINC | Rent calculated on the basis of income |
LARGEDWG | Moved to larger dwelling |
SMALLDWG | Moved to smaller dwelling |
CHEAPDWG | Moved to cheaper dwelling |
BETTRDWG | Moved to better dwelling |
CLOSEFAC | Moved closer to facilities |
ESTHHLD | Moved to establish own household |
CHNGTEN | Moved – tenure change |
CHNGJOB | Moved – job change |
CLOSWORK | Moved closer to work |
FAMREA | Moved for family reasons |
HEALTHR | Moved for health reasons |
OTHERR | Moved for other reasons |
See the Data Dictionary for more information.
1.1.3 Layout of the document
This document is laid out in the following manner:
- Data Dictionary (variable specifications, code sets and other information).
Note: This information has been collapsed into the separate variable descriptions.
- Technical Information (survey methodology, data quality, and guidelines for tabulation, analysis, and dissemination).
- Record Layout is available in Excel format.
Note: This information is available on request from the Data Resources Library, but should be generally unnecessary.
- Appendices are available in Excel (765 kb) format.
- Appendix A presents the frequency counts for non-dollar variables in the public-use microdata file. They are included to help you verify your tabulations.
- Appendix B presents expenditure data tabulated using the public-use microdata file and also using the internal survey database. They are included to help you verify your tabulations.
- Appendix C contains a table indicating the spending variables included in previous public-use microdata files of the Survey of Household Spending and the Family Expenditure Survey.
- Appendix D shows any changes in variables from the previous year.
- Appendix E presents the coefficients of variation for published data from the 2002 SHS.
1.1.4 For further information
Additional information about the SHS can now be obtained free on the Statistics Canada web site (www.statcan.ca). See especially:
- Note to former users of data from the Family Expenditure Survey (62F0026MIE2000002)
- Note to former users of data from the Household Facilities and Equipment Survey (62F0026MIE2000003)
- User Guide for the Survey of Household Spending, 2002 (62F0026MIE2003002)
- Methodology for the Survey of Household Spending (62F0026MIE2001003)
- 2001 Survey of Household Spending Data Quality Indicators (62F0026MIE2003001)
For more information about the current survey results and related products and services, or to enquire about the concepts, methods or data quality of the Survey of Household Spending, contact Client Services (613-951-7355; 1-888-297-7355; fax 613-951-3012; income@statcan.ca), Income Statistics Division.
1.2 Technical characteristics of the file
Variables are grouped under the following headings:
- Location
- Dwelling
- Characteristics of reference person
- Characteristics of spouse of reference person
- Household description
- Household equipment (at December 31)
- Expenditure items
- Food
- Shelter
- Household operation
- Household furnishings and equipment
- Clothing
- Transportation
- Health care
- Personal care
- Recreation
- Reading materials and other printed matter
- Education
- Tobacco products and alcoholic beverages
- Other expenses
3. Technical information
3.1 Survey methodology
(For more detailed information, see the Methodology of the Survey of Household Spending available free on the Statistics Canada web site at www.statcan.ca).
3.1.1 The survey universe
The 2002 Survey of Household Spending was carried out in private households in Canada's 10 provinces. (Note: In order to reduce response burden for northern households, the SHS is conducted in the north only every second year, starting in 2001.)
The following groups were excluded from the survey:
- those living on Indian reserves and crown lands;
- official representatives of foreign countries living in Canada and their families;
- members of religious and other communal colonies;
- members of the Canadian Armed Forces living in Military Camps;
- people living in residences for senior citizens; and
- people living full time in institutions: for example, inmates of penal institutions and chronic care patients living in hospitals and nursing homes.
The survey covers about 98% of the population in the 10 provinces.
Information was not gathered from persons temporarily living away from their families (for example, students at university), because it would be gathered from their families if selected. In this way, double counting of such individuals was avoided.
Data from part-year households should be excluded from estimates of average household spending. However, these data must be included in the estimates for dwelling characteristics and household equipment and in the calculation of the Survey of Household Spending response rate. Part-year households are composed entirely of persons who were members of other households for part of the reference year. There were 475 part-year households in the sample in 2002.
3.1.2 Survey content and reference period
Detailed information was collected about expenditures for consumer goods and services, changes in assets, mortgages and other loans, and annual income. This information was collected for the calendar year 2002 (the survey reference year). Information was also collected about dwelling characteristics (e.g., type and age of heating equipment) and household equipment (e.g., appliances, communications equipment, and vehicles). This type of information was collected as of December 31st of the reference year.
Because the Survey of Household Spending is designed principally to provide detailed information on non-food expenditures, only an overall estimate of food expenditure is recorded. Detailed information on food expenditure is provided by the Food Expenditure Survey, which is conducted every four to six years. It was last conducted in 2001. In February 2003, the results were published in Food Expenditure in Canada, 2001, Catalogue no. 62-554-XIE.
3.1.3 The sample
The sample size for the 2002 Survey of Household Spending was 20,861 eligible households.
This sample was a stratified, multi-stage sample selected from the Labour Force Survey (LFS) sampling frame. Sample selection comprised two main steps: the selection of clusters (small geographic areas) from the LFS frame and the selection of dwellings within these selected clusters. The LFS sampling frame mainly uses 1991 Census geography and 1991 population counts. (Note: A detailed description of the Labour Force Survey sampling frame can be found in Methodology of the Canadian Labour Force Survey, Statistics Canada, Catalogue no. 71-526-XPB.
3.1.4 Data collection
The 2002 Survey of Household Spending was conducted from January to March 2003. Data were collected during a personal interview using a paper questionnaire. A copy of this questionnaire is available on request.
3.1.5 Data processing and quality control
Data entry and automated editing for the 2002 Survey of Household Spending took place in the Statistics Canada regional offices. This allowed respondents to be contacted in the event that more information was required to resolve an inconsistency on their questionnaires.
After data entry, an automated physical edit system checked for data entry errors. Data had to pass a two-tier edit system consisting of "must-pass" edits that checked questionnaires for logic and consistency, and "warnings" that indicated that a particular situation was unusual and could require correction. Either type of edit resulted in the intervention of a member of one of the specially trained edit resolution teams. Further editing of the data took place in head office where invalid responses were corrected. Missing responses were imputed using the nearest neighbour method. Statistics Canada's Canadian Census Edit and Imputation System (CANCEIS) was used to insert values from donor records having similar characteristics, chosen specifically to fit the variable. For example, total household income was used for most variables; dwelling type, household size and province were also frequently used.
Tabulation for the 2002 Survey of Household Spending was accomplished using a PC/client server-based system. This system provides tools (database querying, searching, and viewing capabilities) for spotting systematic errors.
3.1.6 Weighting
The estimation of population characteristics from a sample survey is based on the premise that each sampled unit represents a certain number of units in the population. A basic survey weight was attached to each record in the sample to reflect this representation. These basic weights were adjusted for non-response for selected metropolitan areas, additional geographical areas and for high-income strata. The additional geographical areas comprise the remaining metropolitan areas and urban and rural areas based on census definitions but do not necessarily correspond exactly. For definitions of these terms, refer to the 1996 Census Dictionary, Catalogue no. 92-351- XPE.
To increase the reliability of the estimates, weights were adjusted to ensure that estimates based on relevant characteristics of the population would respect population totals from sources other than the survey. For the 10 provinces, there are two sets of totals.
The first set of totals, for age/sex groups, household size and household type at the province level, is based on projections at mid-January 2003 using the 1996 Census of Population (adjusted for net undercoverage). Controls for 18 age/sex groups are used. These are combined with totals for one-person households, two-person households and more than two-person households. There are also totals for the number of single-parent families and couples with never-married children. Finally, for the 14 selected metropolitan areas, only two age groups were used: number of persons under 18, and number of persons 18 and over.
The second set of totals is derived from T4 information from Canada Customs and Revenue Agency (CCRA, formerly Revenue Canada) and is intended to ensure that the weighted distribution of income (based on wages and salaries) in the data set matches that of the Canadian population.
The switch from 1991 to 1996 Census-based population totals and the use of T4 information from CCRA were introduced starting with the 1999 SHS. Revised SHS estimates for earlier survey years are available and should be used for year-over-year comparisons.
3.2 Data quality
(For more detailed information, see the Survey of Household Spending Data Quality Indicators, soon to be available free on the Statistics Canada web site at www.statcan.ca.)
3.2.1 Sampling error
Sampling errors occur because inferences about the entire population are based on information obtained from only a sample of the population. The sample design, the variability of the data, and the sample size determine the size of the sampling error. In addition, for a given sample design, different methods of estimation will result in different sampling errors.
The design for the 2002 Survey of Household Spending was a stratified multi-stage sampling scheme. The sampling errors for multi-stage sampling are usually higher than for a simple random sample of the same size. However, the operational advantages outweigh this disadvantage, and the fact that the sample is also stratified improves the precision of estimates.
Data variability is the difference between members of the population with respect to spending on a specific item or the presence of a specific dwelling characteristic or piece of household equipment. In general, the greater these differences are, the larger the sampling error will be. In addition, the larger the sample size, the smaller the sampling error.
3.2.1.1 Standard error and coefficient of variation
A common measure of sampling error is the standard error (SE). Standard error is the degree of variation in the estimates as a result of selecting one particular sample rather than another of the same size and design. It has been shown that the "true" value of the characteristic of interest lies within a range of +/- 1 standard error of the estimate for 68% of all samples, and +/- 2 standard errors for 95% of all samples.
The coefficient of variation (CV) is the standard error expressed as a percentage of the estimate. It is used to indicate the degree of uncertainty associated with an estimate. For example, if the estimate of the number of households having a given dwelling characteristic is 10,000 households, and the corresponding CV is 5%, then the "true" value is between 9,500 and 10,500 households, 68% of the time and between 9,000 and 11,000 households, 95% of the time.
Standard errors for the 2002 Survey of Household Spending were estimated using the jackknife technique, which leads to a slight over-estimation and is, thus, conservative. For more information, refer to the Statistics Canada publication, Methodology of the Canadian Labour Force Survey, Catalogue no. 71-526XPB.
Coefficients of variation are presented in technical tables 1 and 2 in Appendix E (Excel, 765 kb).
3.2.1.2 Data suppression
For reliability reasons, estimates with CVs greater than 33% should be suppressed. Since CVs are not calculated for all estimates, data suppression for the Survey of Household Spending has been based on a relationship between the CV and the number of households reporting expenditure on an item. Analysis of past survey results indicates that CVs usually reach this level when the number of households reporting an item drops to about 30. Therefore, data have been suppressed for spending on items reported by fewer than 30 households.
However, data for suppressed items do contribute to summary level variables. For example, the expenditure for a particular category of clothing might be suppressed but this amount forms part of the total expenditure estimate for clothing.
3.2.2 Non-sampling error
Non-sampling errors occur because certain factors make it difficult to obtain accurate responses or responses that retain their accuracy throughout processing. Unlike sampling error, non-sampling error is not readily quantified. Four sources of nonsampling error can be identified: coverage error, response error, non-response error, and processing error.
3.2.2.1 Coverage error
Coverage error results from inadequate representation of the intended population. This error may occur during sample design or selection, or during data collection and processing.
3.2.2.2 Response error
Response error may be due to many factors, including faulty design of the questionnaire, interviewers' or respondents' misinterpretation of questions, or respondents' faulty reporting. In the Survey of Household Spending, the difference between receipts and disbursements is calculated as a check on respondents' recall. This important quality control tool involves the balancing of receipts (income and other money received by the household) and disbursements (total expenditure plus the variable Money flows—assets, loans, and other debts) for each questionnaire. If the difference is greater than 10% of the larger of receipts or disbursements, respondents are contacted again for additional information. This ensures that expenditures, at least at the aggregate level, match household income and other sources of funds.
Several features of the survey help respondents recall their expenditures as accurately as possible. First, the survey period is the calendar year because it is probably more clearly defined in people's minds than any other period of similar length. Second, expenditure on food (about 11% of the average budget in 2002) can be estimated as either weekly or monthly expenses depending on the respondent's purchasing habits. Third, expenses on smaller items purchased at regular intervals are usually estimated on the basis of amount and frequency of purchase. Purchases of large items (automobiles, for example) are recalled fairly easily, as are expenditures on rent, property taxes, and monthly payments on mortgages. However, even with these items, the accuracy of data depends on the respondent's ability to remember and willingness to consult records.
3.2.2.3 Non-response error
Non-response error occurs in sample surveys because not all potential respondents cooperate fully. The extent of non-response varies from partial non-response to total non-response.
Total non-response occurs when the interviewer is unable to contact the respondent, no member of the household is able to provide information, or the respondent refuses to participate in the survey. Total non-response is handled by adjusting the basic survey weight for responding households to compensate for non-responding households. For the 2002 Survey of Household Spending, the overall response rate was 70.5%. See Figure 1 for provincial response rates.
In most cases, partial non-response occurs when the respondent does not understand or misinterprets a question, refuses to answer a question, or is unable to recall the requested information. Imputing missing values compensates for this partial nonresponse. The importance of the non-response error is unknown but in general this error is significant when a group of people with particular characteristics in common refuse to cooperate and where those characteristics are important determinants of survey results.
Figure 1: Response rates, Canada and provinces, 2002 |
Province | Eligible households (see note 1) | Noncontacts | Refusals | Unusables (see note 2) | Usables | Response rate (see note 3) |
Newfoundland and Labrador | 1,681 | 130 | 224 | 70 | 1,257 | 74.8% |
Prince Edward Island | 799 | 36 | 115 | 11 | 637 | 79.7% |
Nova Scotia | 2,063 | 148 | 429 | 119 | 1,367 | 66.3% |
New Brunswick | 1,766 | 115 | 349 | 63 | 1,239 | 70.2% |
Quebec | 2,760 | 193 | 571 | 7 | 1,989 | 72.1% |
Ontario | 3,159 | 307 | 738 | 128 | 1,986 | 62.9% |
Manitoba | 1,858 | 95 | 296 | 24 | 1,443 | 77.7% |
Saskatchewan | 1,963 | 105 | 338 | 19 | 1,501 | 76.5% |
Alberta | 2,105 | 144 | 417 | 52 | 1,492 | 70.9% |
British Columbia | 2,707 | 219 | 514 | 181 | 1,793 | 66.2% |
Canada | 20,861 | 1,492 | 3,991 | 674 | 14,704 | 70.5% |
Note 1: Part-year households are included in the calculation of response rates. There were 475 part-year households in 2002. |
Note 2: Rejected at the editing stage. |
Note 3: Usable/eligible*100 |
3.2.2.4 Processing error
Processing errors may occur in any of the data processing stages, for example, during data entry, editing, weighting, and tabulation. See Data Processing and Quality Control (above) for a description of the steps taken to reduce processing error.
3.2.3 The effect of large values
For any sample, estimates can be affected by the presence or absence of extreme values from the population. These extreme values are most likely to arise from positively skewed populations. The nature of the subject matter of the SHS lends itself to such extreme values. Estimates of totals, averages and standard errors may be greatly influenced by the presence or absence of these extremes.
3.2.4 Comparability over time
Conducted since 1997, the Survey of Household Spending integrates most of the content found in the Family Expenditure Survey and the Household Facilities and Equipment Survey. Many variables from these two surveys are comparable to those in the Survey of Household Spending. However, some differences related to the methodology, to data quality and to definitions must be considered before making comparisons.
For more information, refer to Note to Former Users of Data from the Family Expenditure Survey, Catalogue no. 62F0026MIE2000002 and Note to Former Users of Data from the Household Facilities and Equipment Survey, Catalogue no. 62F0026MIE2000003. Both documents are available free of charge on the Statistics Canada web site (www.statcan.ca).
Historical data from the 1997 and 1998 surveys of household spending, the 1996 Family Expenditure Survey and the 1996 Household Facilities and Equipment Survey have been re-weighted using the weighting methodology described in the section "Weighting". Historical comparisons between data from those surveys and data from recent years of the Survey of Household Spending should generally be made with re-weighted data, although the differences between survey estimates from the old and new methodologies appear to be minimal at a summary level. Certain populations or variables, however, may be more strongly affected.
3.3 Guidelines for tabulation, analysis and dissemination
This section describes the guidelines that users should follow when totalling, analysing, publishing or releasing data taken from the public-use microdata file.
3.3.1 Important note to users about full and part-year households
In 1997, the Survey of Family Expenditure (FAMEX) and the Household Facilities and Equipment Survey (HFE) were replaced by the Survey of Household Spending (SHS). FAMEX microdata files included full-year households only, as only such households could give a clear picture of income and expenditures over an entire year. HFE microdata, on the other hand, included all households, since data were collected as of December 31. To meet user needs, all households are listed on the SHS file, along with a variable indicating each household's status (full-year, part-year). (Note: A full-year household has at least one member present throughout the year. A part-year household consists entirely of members present only part of the year. A member present for part of the year is a member of a household who has been present less than 52 weeks. Income and expenditure data for members present just part of the year are collected for only that part of the year they were included in the household.)
To create statistics for average annual expenditures, users should use records for full-year households. To tabulate dwelling characteristics, household equipment or create other types of expenditure statistics such as totals (aggregates) or market share, users should use records for full-year and part-year households.
3.3.2 Guidelines for rounding
To ensure that estimates from this microdata file intended for publication or any other type of release correspond to estimates that would be obtained by Statistics Canada, we strongly recommend that users comply with the following guidelines for rounding estimates.
- Estimates in the body of a statistical table must be rounded to the nearest hundredth using the traditional rounding technique, i.e., if the first or only number to be eliminated is between 0 and 4, the preceding number does not change. If the first or only number to be eliminated is between 5 and 9, the value of the last number to be retained increases by 1. For example, when using the traditional technique of rounding to the nearest hundredth, if the last two numbers are between 00 and 49, they are replaced by 00 and the preceding number (denoting hundredths) stays as is. If the last two numbers are between 50 and 99, they are replaced with 00 and the preceding number increased by 1.
- Total partial sub-totals and total sub-totals in statistical tables must be calculated using their unrounded corresponding components, then rounded in turn to the closest hundredth using the traditional rounding technique.
- Means, ratios, rates and percentages must be calculated using unrounded components (i.e., numerators and/or denominators), and then rounded to a decimal using the traditional rounding technique.
- Totals and differences in aggregates (or ratios) must be calculated using their corresponding unrounded components, then rounded to the nearest hundredth (or decimal place) using the traditional rounding technique.
- If, due to technical or other limitations, a technique other than traditional rounding is used, with the result that the estimates to be published or released differ in any form from the corresponding estimates that would be obtained by Statistics Canada using this microdata file, we strongly advise users to indicate the reasons for the differences in the documents to be published or released.
- Unrounded estimates cannot under any circumstances be published or released in any way whatsoever by users. Unrounded estimates give the impression that they are much more precise than they actually are.
3.3.3 Guidelines for the weighting of the sample for totalling purposes
The sample design used for the SHS is not self-weighted, meaning that the households in the sample do not all have the same sampling weight. To produce simple estimates, including standard statistical tables, users must use the appropriate sampling weight. Otherwise, the estimates calculated using the microdata files cannot be considered as representative of the observed population and will not correspond to those that would be obtained by Statistics Canada using this microdata file. See Section 3.1.6, "Weighting." Users should also note that depending on the method they use to process the weight field, some software packages may not produce estimates that correspond exactly to those of Statistics Canada using this microdata file.
3.3.4 Types of estimates: categorical versus quantitative
Before discussing how SHS data can be totalled and analysed, it is useful to describe the two main types of estimations that may be produced from the microdata file for the Survey of Household Spending.
3.3.4.1 Categorical estimates
Categorical estimates are estimates of the number or percentage of households in the survey's target population that have certain characteristics or belong to a defined category. The number of households reporting a particular expenditure is an example of this type of estimate. The expression 'aggregate estimate' can also be used to refer to an estimate of the number of individuals with a given characteristic.
Examples of categorical questions:
Did you have a cellular phone for personal use? _yes _no
When was this dwelling originally built?
_ 1920 or earlier
_ 1921-1945
_ 1946-1960
_ 1961-1970
_ 1971-1980
_ 1981-1990
_ 1991-2000
_ 2001
_ 2002
On December 31, 2002, was your dwelling:
_ Owned without a mortgage by your household?
_ Owned with (a) mortgage(s) by your household?
_ Rented by your household?
_ Occupied rent-free by your household?
Totalling of categorical estimates
Estimates of the number of persons with a given characteristic can be obtained from the microdata file by adding the final weights of all records containing the desired characteristic or characteristics. Percentages and ratios in the X/Y form are obtained as follows:
- by adding the final weights of records containing the desired characteristic for the numerator X;
- by adding the final weights of records containing the desired characteristic for the denominator Y;
- by dividing the estimate for the numerator by the estimate for the denominator.
3.3.4.2 Quantitative estimates
Quantitative estimates are estimates of totals or means, medians or other central tendency measurements of quantities based on all members of the observed population or based on some of them. They also explicitly include estimates in the form X/Y where X is an estimate of the total quantity for the observed population and Y is an estimate of the number of individuals in the observed population who contribute to that total quantity. An example of a quantitative estimate is mean annual expenditure for personal and health care per household in the target population. The numerator corresponds to an estimate of total annual expenditure for personal and health care, and the denominator corresponds to an estimate of the number of households in the population.
Example of quantitative question:
In 2002, how much did your household spend for telephone service? ______
Totalling of quantitative estimates
Quantitative estimates can be obtained from the microdata file by multiplying the value of the desired variable by the final weight of each record, and then adding this quantity for all records of interest. For example, to obtain an estimate of total expenditure by households that were owners on December 31 for electricity, the value reported for the question "In 2002, how much did your household spend on electricity?" is multiplied by the final weight of the record, and then that result is summed over all records with a positive response to the question "On December 31, 2002, was your house: 'Owned mortgage-free by your household' or 'Owned with one or more mortgages by your household'."
To obtain a weighted mean expressed by the formula X/Y, the numerator X is calculated as a quantitative estimate and the denominator Y as a categorical estimate. For example, to estimate mean household expenditures for electricity by owners, you must:
- estimate the total expenditure for electricity for households where the residence is owned, using the method described above;
- estimate the number of owned households by adding the final weights for all records with a positive response to the question "As at December 31, 2002, was your house: 'Owned mortgage-free by your household' or 'Owned with one or more mortgages by your household"; and then,
- divide the estimate obtained in a) by the one calculated in b).
Note: Because average expenditures are being estimated, "part-year" households must first be excluded from calculations (for further details, see Section 3.3.1, Important note to users about full and part-year households).
3.3.5 Guidelines for statistical analysis
The Survey of Household Spending is based on a complex survey design that includes stratification and multiple stages of selection, as well as uneven respondent selection probabilities. The use of data from such complex surveys poses problems for analysts, because the survey design and the selection probabilities influence the estimation and variance calculation methods to be used.
Although numerous analytical methods in statistical software packages allow for the use of weights, the meaning or definition of weights differs from that suitable for a sample survey. As a result, although the estimates done using those packages are in many cases accurate, the variances calculated have almost no significance.
For numerous analytical techniques (for example, linear regression, logistic regression, variance analysis), there is a way to make the application of standard packages more significant. If the weights of the records contained in the file are converted so that the mean weight is (1), the results produced by standard packages will be more reasonable and will take into account uneven selection probabilities, although they still cannot take into account the stratification and the cluster distribution of the sample. The conversion can be done using in the analysis a weight equal to the original weight divided by the mean of original weights for sampling units (households) that contribute to the estimator in question. However, because this method still does not take into account sample design stratification and clusters, the estimates of the variance calculated in this way will very likely be underestimates of true values.
3.3.6 Guidelines for release
Before releasing and/or publishing estimates taken from the microdata file, users must first determine the level of reliability of the estimates. The quality of the data is affected by the sampling error and the non-sampling error as described above. However, the level of reliability of estimates is determined solely on the basis of sampling error, as evaluated using the coefficient of variation (CV) as shown in the table below. In addition to calculating CVs, users should also read the section of this document regarding the characteristics of data quality.
Whatever CV is obtained for an estimate from this microdata file, users should determine the number of sampled respondents who contribute to the calculation of the estimate. If this number is less than 30, the weighted estimate should not be released regardless of the value of the CV for this estimate. For weighted estimates based on sample sizes of 30 or more, users should determine the CV of the rounded estimate following the guidelines below.
Figure 2: Sampling variability guidelines |
Type of Estimate | CV (in %) | Guidelines |
1. Acceptable | 0.0 – 16.5 | Estimates can be considered for general unrestricted release. Requires no special notation. |
2. Marginal | 16.6 – 33.3 | Estimates can be considered for general unrestricted release but should be accompanied by a warning cautioning subsequent users of the high sampling variability associated with the estimates. Such estimates should be identified by the letter M (or in some other similar fashion). |
3. Unacceptable | Greater than 33.3 | Statistics Canada does not recommend the release of estimates of unacceptable quality.
However, if the user chooses to do so then estimates should be flagged with the letter U (or in some other similar fashion) and the following warning should accompany the estimates:
"The user is advised that . . . (specify the data) . . . do not meet Statistics Canada's quality standards for this statistical program. Conclusions based on these data will be unreliable and most likely invalid." |
3.3.6.1 Computation of approximate CVs
In order to provide a way of assessing the quality of estimates, Statistics Canada has produced a coefficient of variation table (CV table) which is applicable to estimates of averages, ratios and totals obtained from this public use microdata file for the major variables of the SHS by province and at the Canada level (see Appendix E). The CV of an estimate is defined to be the square root of the variance of the estimate divided by the estimate itself and expressed as a percentage. The numerator of the CV is a measure of the sampling error of the estimate, called the standard error, and is calculated at Statistics Canada with the Jackknife method. This method requires, among other things, information about the strata and the clusters, which can't be given on the public use microdata file for reasons of confidentiality. So that users may estimate CVs for variables not included in the CV tables, Statistics Canada has produced a set of rules to obtain approximate CVs for a wide variety of estimates. It should be noted that these rules provide approximate and, therefore, unofficial CVs. The quality of the approximation, however, is quite satisfactory, especially for the most reliable estimates. Note that accuracy of this approximation is reduced when the domains become smaller. Therefore, the CV approximation method must be used prudently when the domains are small. The document on data quality for the 1997 SHS contains the results of the evaluation of the performance of the CV approximation method.
How to obtain approximate CVs
The following rules should enable the user to determine the approximate coefficients of variation for estimates of totals, means or proportions, ratios and differences between such estimates for sub-populations (domains) for which the Jackknife CV is not provided in the CV tables.
Important: If the number of observations on which an estimate is based is less than 30, the weighted estimate should not be released regardless of the value of the CV for this estimate.
Rule 1: Approximating CVs for estimates of totals (aggregates)
All the steps below must be followed to obtain an approximate CV (ACV) for an estimate of a total (either a number of households possessing a certain characteristic (categorical estimate) or a total of some expense for all households (quantitative estimate)) for a subpopulation (domain) of interest:
- Create a binary variable for each household, say I, equalling 1 if the household is part of the domain of interest, i.e. possesses the desired characteristic and 0 otherwise;
- To estimate a quantitative variable, create a variable Y representing the product of the binary variable I and the variable of interest. To estimate a categorical variable, create a variable Z equal to 1 if the categorical variable is equal to the value of interest, and equal to 0 otherwise. Define variable Y as the product of I and Z;
- Do step (4) to step (9) for each province separately;
- Calculate the sum over all the households of the product of the final weight (section Weighting), and Y (this sum represents the estimate of the total for the domain of interest in the province under consideration);
- Calculate the sum over all the households of the product of the final weight and the household size;
- Divide the result obtained in step (4) by the result obtained in step (5);
- For each household, multiply the result obtained in step (6) by the household size;
- For each household, define a variable, say E, by the subtraction of the result obtained in step (7) from Y;
- Calculate the sum over all the households of the product of the final weight minus 1, the final weight and E squared; (this sum represents the estimated variance of the total estimated at step 4);
- Add up the result obtained in step (9) for each province;
- The ACV is defined to be 100 times the square root of the result obtained in step (10), divided by the estimate. The estimate is the sum over all the provinces of the result obtained in step (4).
More formally, steps 1 to 10 above can be obtained with the following formula: see online, page 94 (PDF, 475 kb)
Note: Two household size variables appear in the microdata file. To calculate approximate CVs, the variable used to define household size is "Household size at December 31," rather than "Household size (number of persons a member sometime in reference year)."
Important: When estimating variance for a given domain, do not limit yourself to units belonging to the domain. The entire sample should always be used to estimate variance. Units that do not belong to the domain of interest are not considered when computing the point estimate of the total, but do contribute when estimating the variance.
Rule 2: Approximating CV for estimates of averages or proportions
An estimated mean or proportion is obtained by the ratio of two estimated totals. For a proportion, the numerator is an estimate that is a sub-set of the denominator, for example the proportion of expenditures for households in Manitoba compared to all Canadian households. The CV of an estimated mean or proportion tends generally to be slightly lower than the corresponding CV of the numerator. The CV of an estimated mean or proportion can thus be approximated with the CV of the numerator and the technique described in rule (1) can be used.
Rule 3: Approximating CV for estimates of ratios
see online, page 94 (PDF, 475 kb)
Rule 4: Approximating CVs for estimates of differences
see online, page 95 (PDF, 475 kb)
Examples see online, pages 95 to 100 (PDF, 475 kb)
3.3.6.2 How to obtain confidence limits
Although coefficients of variation are widely used, a more intuitively meaningful measure of sampling error is the confidence interval of an estimate. A confidence interval constitutes a statement on the level of confidence that the true value for the population lies within a specified range of values. For example a 95% confidence interval can be described as follows.
If sampling of a population is repeated many times, each sample leading to a new confidence interval for an estimate, then in 95% of the samples the interval will cover the true population value.
Using the CV of an estimate, its confidence intervals may be obtained assuming that, under repeated sampling of the population, the various estimates obtained for a characteristic are normally distributed around the true population value. Using this assumption, the chances are about 68 out of 100 that the difference between a sample estimate and the true population value would be less than one standard error, about 95 out of 100 that the difference would be less than two standard errors, and about 99 out 100 that the differences would be less than three standard errors. These different degrees of confidence are referred to as the confidence levels.
Confidence intervals for an estimate, EST, are generally expressed as two numbers, one below the estimate and one above the estimate, as (EST - k, EST + k) where k is determined depending on the level of confidence desired and the sampling error of the estimate.
Confidence intervals for an estimate can be calculated by first determining the ACV of the estimate and then using the following formula to convert to a confidence interval CI:
(EST - z * EST * ACV/100 ; EST - z * EST * ACV/100)
where
z = 1 if a 68% confidence interval is desired,
z = 1.6 if a 90% confidence interval is desired,
z = 2 if a 95% confidence interval is desired,
z = 3 if a 99% confidence interval is desired.
Note: Release guidelines, which apply to the estimate, also apply to the confidence interval. For example, if the estimate is not releasable, then the confidence interval is not releasable either.
Example 4
A 95% confidence interval for the estimated mean of spending on household furnishings and equipment for one-person households in Manitoba would be calculated as follows:
EST = 715.05
z = 2
ACV = 12.80
CI = (715.05 – 2 x 715.05 x 12.80/100 ; 715.05 + 2 x 715.05 x 12.80/100)
= (532.00; 898.10)
3.3.6.3 How to do a Z-test
Coefficients of variation may also be used to perform hypothesis testing, a procedure for distinguishing between population parameters using sample estimates. The sample estimates can be totals, averages, ratios, etc. Tests may be performed at various levels of significance, where a level of significance is the probability of concluding that the characteristics are different when, in fact, they are identical.
Let EST1 and EST2 be sample estimates for 2 characteristics of interest. Let the approximate CV of the difference EST1 – EST2 be ACVDIFF.
If z = 1 / ACVDIFF is less than 2, then no conclusion about the difference between the characteristics is justified at the 5% level of significance. If however, this ratio is larger than 2, the observed difference is significant at the 5% level.
Example 5
Let us suppose we wish to test, at the 5% level of significance, the hypothesis that there is no difference between the total of spending on furnishings and equipment in Alberta and the same total in Manitoba. From example 3, the approximate CV of the difference between these two estimates was found to be 5.90 and z = 16.9. Since this value is greater than 2, it must be concluded that there is significant difference between the two estimates at the 0.05 level of significance.
3.4 Confidentiality of the public-use microdata
Microdata files for public use differ in many ways from the master file of the survey held by Statistics Canada. These variations are due to measures taken to preserve the anonymity of respondents to the survey.
The confidentiality of this file is ensured mainly by reducing information, i.e., deleting variables or suppressing or collapsing some of their detail.
To protect confidentiality
- All explicitly identifying information, such as identification numbers, was removed from the file. (Names and addresses are not data captured).
- 170 records had their province codes set to 0 due to special characteristics (e.g., exceedingly high or low expenditure values). These records were reweighted.
- There was top-coding and collapsing of code sets for non-spending variables.
- Income values at the household, reference person and spouse of reference person levels were rounded in the following manner:
For income values between $1 and $9,999: round to the nearest $100
For income values between $10,000 and $99,999: round to the nearest $1,000
For income values between $100,000 and $999,999: round to the nearest $10,000
For income values between $1,000,000 and $9,999,999: round to the nearest $100,000
For income values between $10,000,000 and $99,999,999: round to the nearest $1,000,000 (there are no such values on the 2002 file).
- The variables "Purchase price of dwelling" and "Selling price of dwelling" were also rounded.
4. APPENDICES – See Excel file
- APPENDIX A
- Frequency counts – Public-use microdata file – SHS 2002
- APPENDIX B
- Part 1 of 3: Averages, aggregates, minimum and maximum values, Public-use microdata file – SHS 2002, (Full-year and part-year households)
- Part 2 of 3: Averages, aggregates, minimum and maximum values, Public-use microdata file – SHS 2002, (Full-year households)
- Part 3 of 3: Averages and aggregates, Unsuppressed survey file, SHS 2002, (Full-year and part-year households)
- APPENDIX C
- Inclusion of spending variables in past microdata files
- APPENDIX D
- Comparison of variables from the 2001 and the 2002 SHS
- APPENDIX E
- Technical Table 1: Coefficients of variation for average household expenditures, 2002
- Technical Table 2: Coefficients of variation for dwelling characteristics and household equipment, 2002