Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support

In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorov...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on signal processing Vol. 65; no. 9; pp. 2317 - 2332
Main Authors: Jianbo Ye, Panruo Wu, Wang, James Z., Jia Li
Format: Journal Article
Language:English
Published: IEEE 01.05.2017
Subjects:
ISSN:1053-587X, 1941-0476
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution, called Wasserstein barycenter, that minimizes its sum of squared distances to the cluster members. In this paper, we develop a modified Bregman ADMM approach for computing the approximate discrete Wasserstein barycenter of large clusters. In the case when the support points of the barycenters are unknown and have low cardinality, our method achieves high accuracy empirically at a much reduced computational cost. The strengths and weaknesses of our method and its alternatives are examined through experiments, and we recommend scenarios for their respective usage. Moreover, we develop both serial and parallelized versions of the algorithm. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods in the corresponding areas.
AbstractList In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution, called Wasserstein barycenter, that minimizes its sum of squared distances to the cluster members. In this paper, we develop a modified Bregman ADMM approach for computing the approximate discrete Wasserstein barycenter of large clusters. In the case when the support points of the barycenters are unknown and have low cardinality, our method achieves high accuracy empirically at a much reduced computational cost. The strengths and weaknesses of our method and its alternatives are examined through experiments, and we recommend scenarios for their respective usage. Moreover, we develop both serial and parallelized versions of the algorithm. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods in the corresponding areas.
Author Panruo Wu
Jia Li
Wang, James Z.
Jianbo Ye
Author_xml – sequence: 1
  surname: Jianbo Ye
  fullname: Jianbo Ye
  email: jxy198@ist.psu.edu
  organization: Coll. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
– sequence: 2
  surname: Panruo Wu
  fullname: Panruo Wu
  email: pwu011@cs.ucr.edu
  organization: Dept. of Comput. Sci. & Eng., Univ. of California, Riverside, Riverside, CA, USA
– sequence: 3
  givenname: James Z.
  surname: Wang
  fullname: Wang, James Z.
  email: jwang@ist.psu.edu
  organization: Coll. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA, USA
– sequence: 4
  surname: Jia Li
  fullname: Jia Li
  email: jiali@stat.psu.edu
  organization: Dept. of Stat., Pennsylvania State Univ., University Park, PA, USA
BookMark eNp9kE1LAzEQhoNUsK3eBS_5A1sns9mkOWq1KhQUWq23JY1TjdTdJcke_Pfu0uLBg5eZlxme-XhHbFDVFTF2LmAiBJjL1fJpgiD0BFVhlNRHbCiMFBlIrQadhiLPiql-PWGjGD8BhJRGDdnL3MbEb3x0gRL1IgW_aZOvKz7btTFR8NU7f459XNsYKXQ1X_FrG74dVV2fr3364MvGhkh82TZNHdIpO97aXaSzQx6z1fx2NbvPFo93D7OrReZQ5SmTErWDQm1EYQwibSWiRqXQ6gLAoFAg7FTpQgJJ45x0ZN-cQUKyStt8zGA_1oU6xkDbsgn-q7usFFD2tpSdLWVvS3mwpUPUH8T5ZPt_U7B-9x94sQc9Ef3u0dM8R5D5D_VFckQ
CODEN ITPRED
CitedBy_id crossref_primary_10_1145_3688800
crossref_primary_10_1016_j_spl_2024_110070
crossref_primary_10_1214_20_AOS1987
crossref_primary_10_1007_s11042_020_09797_3
crossref_primary_10_1109_TGRS_2018_2873966
crossref_primary_10_1109_TIM_2021_3050173
crossref_primary_10_1109_TSP_2020_3046227
crossref_primary_10_1137_17M1140431
crossref_primary_10_1007_s00245_022_09911_x
crossref_primary_10_1007_s12351_020_00589_z
crossref_primary_10_1007_s40314_020_01395_1
crossref_primary_10_1287_ijoo_2019_0020
crossref_primary_10_1109_TII_2024_3366996
crossref_primary_10_1109_TCE_2023_3300890
crossref_primary_10_1109_TGRS_2020_2984703
crossref_primary_10_1109_TPAMI_2024_3363780
crossref_primary_10_1109_ACCESS_2021_3072613
crossref_primary_10_1109_TPAMI_2022_3153126
crossref_primary_10_1080_10618600_2018_1448831
crossref_primary_10_1007_s10472_022_09807_0
crossref_primary_10_1109_TPAMI_2023_3314661
crossref_primary_10_1007_s10589_023_00458_3
crossref_primary_10_1111_biom_13630
crossref_primary_10_1109_TPAMI_2019_2908635
crossref_primary_10_1002_sta4_465
crossref_primary_10_1016_j_jmva_2019_104581
crossref_primary_10_1016_j_neunet_2024_106420
crossref_primary_10_1109_TKDE_2024_3386401
crossref_primary_10_1109_TNNLS_2025_3551275
crossref_primary_10_1109_TPAMI_2021_3050750
crossref_primary_10_1016_j_neunet_2024_107038
crossref_primary_10_1109_ACCESS_2025_3542360
Cites_doi 10.1109/LSP.2015.2410217
10.1109/CVPR.2007.383188
10.2307/2284239
10.1109/TPAMI.2007.70847
10.1137/141000439
10.1109/ICIP.2014.7026066
10.1137/15M1032600
10.1137/100805741
10.1287/opre.41.2.338
10.1561/2200000050
10.1214/aoms/1177692631
10.1145/2766963
10.1007/BF01908075
10.1023/A:1026543900054
10.1016/0041-5553(67)90040-7
10.3115/v1/P15-4023
10.14778/2732240.2732249
10.1016/j.ins.2007.02.045
10.1145/1143844.1143889
10.1137/S1052623403425629
10.1145/2700293
10.1137/1129093
10.3115/v1/D14-1162
10.1007/s00186-016-0549-x
10.1016/S0167-6377(02)00231-6
10.1007/BF00940051
10.1145/1143844.1143892
10.1287/moor.18.1.202
10.1051/m2an/2015033
10.1561/2200000016
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TSP.2017.2659647
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1941-0476
EndPage 2332
ExternalDocumentID 10_1109_TSP_2017_2659647
7833204
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation (NSF)
  grantid: CCF-0936948; ACI-1027854; DMS-1521092
  funderid: 10.13039/100000001
– fundername: NSF
  grantid: ACI-0821527 (CyberStar); ACI-1053575 (XSEDE)
  funderid: 10.13039/100000001
GroupedDBID -~X
.DC
0R~
29I
4.4
5GY
6IK
85S
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
ACNCT
AENEX
AGQYO
AHBIQ
AJQPL
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
EBS
EJD
F5P
HZ~
IFIPE
IPLJI
JAVBF
LAI
MS~
O9-
OCL
P2P
RIA
RIE
RNS
TAE
TN5
AAYXX
CITATION
ID FETCH-LOGICAL-c263t-4427c056b159922ef42272662a7500921601a867540e49cc4ceadc92e2ea67a3
IEDL.DBID RIE
ISICitedReferencesCount 85
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000395877600010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1053-587X
IngestDate Tue Nov 18 20:53:12 EST 2025
Sat Nov 29 04:10:42 EST 2025
Tue Aug 26 17:00:18 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 9
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c263t-4427c056b159922ef42272662a7500921601a867540e49cc4ceadc92e2ea67a3
PageCount 16
ParticipantIDs ieee_primary_7833204
crossref_primary_10_1109_TSP_2017_2659647
crossref_citationtrail_10_1109_TSP_2017_2659647
PublicationCentury 2000
PublicationDate 2017-May1,-1
2017-5-1
PublicationDateYYYYMMDD 2017-05-01
PublicationDate_xml – month: 05
  year: 2017
  text: 2017-May1,-1
  day: 01
PublicationDecade 2010
PublicationTitle IEEE transactions on signal processing
PublicationTitleAbbrev TSP
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
References cuturi (ref12) 2014
chizat (ref33) 2016
ref37
ref15
ref36
ref14
ref31
ref30
ref32
schmitzer (ref34) 2016
ref1
ref17
villani (ref2) 2003
ref19
ref18
mikolov (ref44) 2013
vinh (ref45) 2010; 11
rabin (ref23) 2011
ref46
ref24
kusner (ref42) 2015
ref26
ref47
ref25
ref20
ref41
pele (ref10) 2009
ref21
ref43
villani (ref7) 2008; 338
cuturi (ref11) 2013
ref28
hoffman (ref48) 2010
elkan (ref35) 2003; 3
ref27
ref29
solomon (ref22) 2015; 34
ref8
schuhmacher (ref5) 2008
ref9
ref4
ref3
ref6
rosenberg (ref38) 2007; 7
wang (ref16) 2014
ref40
banerjee (ref13) 2005; 6
arthur (ref39) 2007
References_xml – ident: ref6
  doi: 10.1109/LSP.2015.2410217
– ident: ref29
  doi: 10.1109/CVPR.2007.383188
– start-page: 3111
  year: 2013
  ident: ref44
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Proc Adv Neural Inf Process Syst
– start-page: 2292
  year: 2013
  ident: ref11
  article-title: Sinkhorn distances: Lightspeed computation of optimal transport
  publication-title: Proc Adv Neural Inf Process Syst
– ident: ref46
  doi: 10.2307/2284239
– ident: ref9
  doi: 10.1109/TPAMI.2007.70847
– ident: ref19
  doi: 10.1137/141000439
– ident: ref14
  doi: 10.1109/ICIP.2014.7026066
– start-page: 1
  year: 2008
  ident: ref5
  article-title: On performance evaluation of multi-object filters
  publication-title: 2008 11th International Conference on Information Fusion FUSION
– ident: ref21
  doi: 10.1137/15M1032600
– volume: 3
  start-page: 147
  year: 2003
  ident: ref35
  article-title: Using the triangle inequality to accelerate k-means
  publication-title: Proc Int Conf Mach Learn
– ident: ref17
  doi: 10.1137/100805741
– start-page: 957
  year: 2015
  ident: ref42
  article-title: From word embeddings to document distances
  publication-title: Proc Int Conf Mach Learn
– volume: 7
  start-page: 410
  year: 2007
  ident: ref38
  article-title: V-measure: A conditional entropy-based external cluster evaluation measure
  publication-title: Proc Joint Conf EMNLP-CoNLL
– ident: ref8
  doi: 10.1287/opre.41.2.338
– ident: ref32
  doi: 10.1561/2200000050
– ident: ref3
  doi: 10.1214/aoms/1177692631
– volume: 34
  start-page: 66:1
  year: 2015
  ident: ref22
  article-title: Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains
  publication-title: ACM Trans Graph
  doi: 10.1145/2766963
– year: 2016
  ident: ref33
  article-title: Scaling algorithms for unbalanced transport problems
  publication-title: arXiv 1607 05816
– start-page: 685
  year: 2014
  ident: ref12
  article-title: Fast computation of Wasserstein barycenters
  publication-title: Proc Int Conf Mach Learn
– ident: ref47
  doi: 10.1007/BF01908075
– ident: ref4
  doi: 10.1023/A:1026543900054
– ident: ref26
  doi: 10.1016/0041-5553(67)90040-7
– ident: ref37
  doi: 10.3115/v1/P15-4023
– year: 2016
  ident: ref34
  article-title: Stabilized sparse scaling algorithms for entropy regularized transport problems
  publication-title: arXiv 1610 06519
– ident: ref25
  doi: 10.14778/2732240.2732249
– volume: 338
  year: 2008
  ident: ref7
  publication-title: Optimal Transport Old and New
– year: 2003
  ident: ref2
  publication-title: Topics in Optimal Transportation
– ident: ref41
  doi: 10.1016/j.ins.2007.02.045
– start-page: 1027
  year: 2007
  ident: ref39
  article-title: k-means++: The advantages of careful seeding
  publication-title: Proc 18th Annu ACM-SIAM Symp Discr Algorithms
– start-page: 435
  year: 2011
  ident: ref23
  article-title: Wasserstein barycenter and its application to texture mixing
  publication-title: Scale Space and Variational Methods in Computer Vision
– ident: ref40
  doi: 10.1145/1143844.1143889
– volume: 6
  start-page: 1705
  year: 2005
  ident: ref13
  article-title: Clustering with Bregman divergences
  publication-title: J Mach Learn Res
– ident: ref30
  doi: 10.1137/S1052623403425629
– ident: ref15
  doi: 10.1145/2700293
– ident: ref1
  doi: 10.1137/1129093
– ident: ref43
  doi: 10.3115/v1/D14-1162
– ident: ref18
  doi: 10.1007/s00186-016-0549-x
– start-page: 856
  year: 2010
  ident: ref48
  article-title: Online learning for latent Dirichlet allocation
  publication-title: Proc Adv Neural Inf Process Syst
– start-page: 2816
  year: 2014
  ident: ref16
  article-title: Bregman alternating direction method of multipliers
  publication-title: Proc Adv Neural Inf Process Syst
– volume: 11
  start-page: 2837
  year: 2010
  ident: ref45
  article-title: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance
  publication-title: J Mach Learn Res
– ident: ref31
  doi: 10.1016/S0167-6377(02)00231-6
– ident: ref27
  doi: 10.1007/BF00940051
– ident: ref36
  doi: 10.1145/1143844.1143892
– ident: ref28
  doi: 10.1287/moor.18.1.202
– ident: ref20
  doi: 10.1051/m2an/2015033
– ident: ref24
  doi: 10.1561/2200000016
– start-page: 460
  year: 2009
  ident: ref10
  article-title: Fast and robust Earth mover's distances
  publication-title: Proc Int Conf Comp Vis
SSID ssj0014496
Score 2.5902288
Snippet In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 2317
SubjectTerms ADMM
clustering
Clustering algorithms
Discrete distribution
Electronic mail
Histograms
K-means
large-scale learning
Optimization
parallel computing
Prototypes
Signal processing algorithms
Time complexity
Title Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support
URI https://ieeexplore.ieee.org/document/7833204
Volume 65
WOSCitedRecordID wos000395877600010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1941-0476
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014496
  issn: 1053-587X
  databaseCode: RIE
  dateStart: 19910101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LawIxEA5WemgPfdlS-yKHXgqNrkl2kxz7kp5EUKq3JcZIBVHRtb-_M9l1sVAKvSxhSSDMhOSbZOb7CLm3Qo-0UZzB5m-YdKbFtI5jpsaJs9Jzr6JcbEJ1Ono4NN0KeSxrYbz3IfnMN7AZ3vLHC7fBq7Km0kJwJP_cUyrJa7XKFwMpgxYXwAXBYq2G2yfJyDT7vS7mcKkGT2IsvPxxBO1oqoQjpX38v8mckKMCOtKn3NenpOLnZ-Rwh1CwRj7adp3R1ynsBQCGsVEKWtGX2QZJEaAfDXkCdGBDqSXKXdJnvMlBes4VHUyzT9pbQsDrKWp-Aj4_J_32W__lnRXKCczxRGRMSq4cQJsRgBXDuZ9IzhV4g1sACJHhLQjDrIZYQUZeGuekgwXlDAff2ERZcUGq88XcXxKqtWhZ6cB_3shJIix88Fg3SGqT8KhOmltbpq5gFUdxi1kaoovIpGD9FK2fFtavk4dyxDJn1Pijbw0NX_YrbH71--9rcoCD83TEG1LNVht_S_bdVzZdr-7CcvkGJn-7hg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1ba8IwFA7iBtsednNj7pqHvQwWrWnaJI-7iWNOBGX6VmKMTBAVrfv9O6etxcEY7KWEkpTynZB8JznnfITcGl8NlJacweKvmbC6xpQKAiaHoTXCcSe9VGxCtlqq39ftArnPc2Gcc0nwmatgM7nLH87sCo_KqlL5Psfin1uBENxLs7XyOwMhEjUuIAw-C5Tsry8lPV3tdtoYxSUrPAww9fLHJrShqpJsKvWD__3OIdnPyCN9SK19RApuekz2NkoKlshH3Sxj-jyG1QDoMDZySSv6NFlhWQToR5NIAdozSbIlCl7SRzzLwQKdC9obx5-0MweX11FU_QSGfkK69ZfuU4Nl2gnM8tCPGeAjLZCbAdAVzbkbCc4l2IMboAie5jVwxIwCb0F4TmhrhYUpZTUH65hQGv-UFKezqTsjVCm_ZoQFCzotRqFv4IEbu8ayNiH3yqS6xjKyWV1xlLeYRIl_4ekI0I8Q_ShDv0zu8hHztKbGH31LCHzeL8P8_PfXN2Sn0X1vRs3X1tsF2cUPpcGJl6QYL1buimzbr3i8XFwnU-cbFS6-zQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Fast+Discrete+Distribution+Clustering+Using+Wasserstein+Barycenter+With+Sparse+Support&rft.jtitle=IEEE+transactions+on+signal+processing&rft.au=Ye%2C+Jianbo&rft.au=Wu%2C+Panruo&rft.au=Wang%2C+James+Z.&rft.au=Li%2C+Jia&rft.date=2017-05-01&rft.issn=1053-587X&rft.eissn=1941-0476&rft.volume=65&rft.issue=9&rft.spage=2317&rft.epage=2332&rft_id=info:doi/10.1109%2FTSP.2017.2659647&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TSP_2017_2659647
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1053-587X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1053-587X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1053-587X&client=summon