Regression-based variable clustering for data reduction
In many studies it is of interest to cluster states, counties or other small regions in order to obtain improved estimates of disease rates or other summary measures, and a more parsimonious representation of the country as a whole. This may be the case if there are too many to summarize concisely,...
Saved in:
| Published in: | Statistics in medicine Vol. 21; no. 6; pp. 921 - 941 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Chichester, UK
John Wiley & Sons, Ltd
30.03.2002
Wiley |
| Subjects: | |
| ISSN: | 0277-6715, 1097-0258 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In many studies it is of interest to cluster states, counties or other small regions in order to obtain improved estimates of disease rates or other summary measures, and a more parsimonious representation of the country as a whole. This may be the case if there are too many to summarize concisely, and/or many regions with a small number of cases. By merging the regions into larger geographic areas, we obtain more cases within each area (and hence lower standard errors for parameter estimates), as well as fewer areas to summarize in terms of disease rates. The resulting clusters should be such that regions within the same cluster are similar in terms of their disease rates. In this paper we present a clustering algorithm which uses data at the subject‐specific level in order to cluster the original regions into a reduced set of larger areas. The proposed clustering algorithm expresses the clustering goals in terms of a regression framework. This formulation of the problem allows the regions to be clustered in terms of their association with the response, and confounding variables measured at the subject‐specific level may be easily incorporated during the clustering process. Additionally, this framework allows estimation and testing of the association between the areas and the response. The statistical properties and performance of the algorithm were evaluated via simulation studies, and the results are promising. Additional simulations illustrate the importance of controlling for confounding variables during the clustering process, rather than after the clusters are determined. The algorithm is illustrated with data from the Cardiovascular Health Study. Although developed with a specific application in mind, the method is applicable to a wide range of problems. Copyright © 2002 John Wiley & Sons, Ltd. |
|---|---|
| Bibliography: | Georgetown Echo - No. RC-HL 35129 JHU MRI RC-HL 15103 istex:B390D1993EB7844F57590899DAD36C1C79276B3F ark:/67375/WNG-Z8ZZ6FNV-N ArticleID:SIM1063 National Heart, Lung and Blood Institute - No. N01-HC-85079; No. N01-HC-85086 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0277-6715 1097-0258 |
| DOI: | 10.1002/sim.1063 |