A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA

We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that...

Full description

Saved in:
Bibliographic Details
Published in:The Annals of statistics Vol. 44; no. 4; p. 1400
Main Authors: Zhao, Tianqi, Cheng, Guang, Liu, Han
Format: Journal Article
Language:English
Published: United States 01.08.2016
Subjects:
ISSN:0090-5364
Online Access:Get more information
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0090-5364
DOI:10.1214/15-AOS1410