A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data

Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal app...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of proteome research Ročník 24; číslo 1; s. 158
Hlavní autoři: Sakthivel, Kabilan, Lal, Shashi Bhushan, Srivastava, Sudhir, Chaturvedi, Krishna Kumar, Khan, Yasin Jeshima, Mishra, Dwijesh Chandra, Madival, Sharanbasappa D, Vaidhyanathan, Ramasubramanian, Jha, Girish Kumar
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States 03.01.2025
Témata:
ISSN:1535-3907, 1535-3907
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
AbstractList Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Author Srivastava, Sudhir
Jha, Girish Kumar
Chaturvedi, Krishna Kumar
Madival, Sharanbasappa D
Khan, Yasin Jeshima
Sakthivel, Kabilan
Vaidhyanathan, Ramasubramanian
Mishra, Dwijesh Chandra
Lal, Shashi Bhushan
Author_xml – sequence: 1
  givenname: Kabilan
  surname: Sakthivel
  fullname: Sakthivel, Kabilan
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 2
  givenname: Shashi Bhushan
  surname: Lal
  fullname: Lal, Shashi Bhushan
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 3
  givenname: Sudhir
  orcidid: 0000-0003-4990-7693
  surname: Srivastava
  fullname: Srivastava, Sudhir
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 4
  givenname: Krishna Kumar
  surname: Chaturvedi
  fullname: Chaturvedi, Krishna Kumar
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 5
  givenname: Yasin Jeshima
  surname: Khan
  fullname: Khan, Yasin Jeshima
  organization: Division of Genomic Resources, ICAR-National Bureau of Plant Genetic Resources, New Delhi 110012, India
– sequence: 6
  givenname: Dwijesh Chandra
  surname: Mishra
  fullname: Mishra, Dwijesh Chandra
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 7
  givenname: Sharanbasappa D
  surname: Madival
  fullname: Madival, Sharanbasappa D
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
– sequence: 8
  givenname: Ramasubramanian
  surname: Vaidhyanathan
  fullname: Vaidhyanathan, Ramasubramanian
  organization: Research Systems Management, ICAR-National Academy of Agricultural Research Management, Hyderabad 500030, India
– sequence: 9
  givenname: Girish Kumar
  surname: Jha
  fullname: Jha, Girish Kumar
  organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39659155$$D View this record in MEDLINE/PubMed
BookMark eNpNkMtOwzAQRS1URB_wCSAv2aTYcZw0y1JaqFQeErCOJs6EukriEDsS5R_4ZwIpEqt56M7RvTMmg8pUSMg5Z1POfH4Fyk53dWMcmhKngWJMSv-IjLgU0hMxiwb_-iEZW7tjjMuIiRMyFHEoYy7liHzN6bMDp63TCgo6rzskqC3NTUPXGVZO53tdvVG3RXqN1tGFKVNddRemoianD6YpodCf_QKqjK7LunX9eI9uazL7C9tAioW3ahDpU-9aK0uXH3WD1v6Ib8DBKTnOobB4dqgT8rpavizuvM3j7Xox33ggJHfejKdZngPIDAPIglwpESkIIjZjIGdZjoqlsS8DGYHyFYbAMRRBGmEqUAjg_oRc9twu7XvbxUpKbRUWBVRoWpsIHoShH8b-rJNeHKRtWmKW1I0uodknfy_0vwHGlHv4
CitedBy_id crossref_primary_10_1016_j_watres_2025_124438
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1021/acs.jproteome.4c00552
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Chemistry
EISSN 1535-3907
ExternalDocumentID 39659155
Genre Journal Article
GroupedDBID ---
4.4
53G
55A
5GY
5VS
7~N
AABXI
AAHBH
ABJNI
ABMVS
ABQRX
ABUCX
ACGFS
ACS
ADHLV
AEESW
AENEX
AFEFF
AHGAQ
ALMA_UNASSIGNED_HOLDINGS
AQSVZ
BAANH
CGR
CS3
CUPRZ
CUY
CVF
DU5
EBS
ECM
ED~
EIF
F5P
GGK
GNL
IH9
IHE
JG~
NPM
P2P
RNS
ROL
UI2
VF5
VG9
W1F
7X8
ABBLG
ABLBI
ID FETCH-LOGICAL-a351t-81bdffaa5de4ad4fcc37ca47080a58dfec0b925457ac2ce6a1e634b7eb3e33a12
IEDL.DBID 7X8
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001374971900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1535-3907
IngestDate Thu Jul 10 23:01:07 EDT 2025
Fri Apr 25 03:24:11 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords quality control
normalization
bottom-up approach
protein
differential expression analysis
label-free proteomics
missing value imputation
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a351t-81bdffaa5de4ad4fcc37ca47080a58dfec0b925457ac2ce6a1e634b7eb3e33a12
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0003-4990-7693
PMID 39659155
PQID 3146626928
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3146626928
pubmed_primary_39659155
PublicationCentury 2000
PublicationDate 2025-01-03
PublicationDateYYYYMMDD 2025-01-03
PublicationDate_xml – month: 01
  year: 2025
  text: 2025-01-03
  day: 03
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of proteome research
PublicationTitleAlternate J Proteome Res
PublicationYear 2025
SSID ssj0015703
Score 2.459144
Snippet Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 158
SubjectTerms Algorithms
Data Interpretation, Statistical
Humans
Least-Squares Analysis
Linear Models
Proteomics - methods
Proteomics - standards
Proteomics - statistics & numerical data
Title A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data
URI https://www.ncbi.nlm.nih.gov/pubmed/39659155
https://www.proquest.com/docview/3146626928
Volume 24
WOSCitedRecordID wos001374971900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LSwMxEA5qBb34ftQXEbxGu0m26Z6kaouCLT0o9FayyQSUulu7VfwT_mcn2S09CYKXZRf2EWaS7DfJzPcRcqGM8CgjZYqDYhKkZgiOgFmpuYyThouMC2ITqt9vDYfJoFpwK6q0yvmcGCZqmxu_Rn4lcEgj-E5463ryzrxqlN9drSQ0lklNIJTxKV1quNhF8OxSJV9qzDC2V_MKHh5daVNcvgYqhPwNLqXxXFT8d5QZ_jbdzf-2c4tsVDiTtsuOsU2WINsha7dzebdd8t2mHmgGnmZ_Y8UtThHE0rJ6N1RAUQSI9AY_T3HmwCg6OJLmjvY92B1XVZxUZ5Y-eH2I8rIXdKmL8LJHncKYdacAdFBa4sUUtPNVZeBm9E7P9B557naebu9Zpc3AtIijmXeodU7r2KJ7rXTGCGW0VAhAddyyDkwjTTD4jJU23EBTR9AUMlUYu4MQOuL7ZCXLMzgk1KYIGtLYcUhjCSZK0X4SzxKF2KJhW3VyPrf0CG3kNzR0BvlHMVrYuk4OSneNJiVJx0h4pkQES0d_ePqYrHMv6xuyy05IzeHIh1Oyaj7RC9Oz0Knw2B_0fgC1fdt-
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Statistical+Approach+for+Identifying+the+Best+Combination+of+Normalization+and+Imputation+Methods+for+Label-Free+Proteomics+Expression+Data&rft.jtitle=Journal+of+proteome+research&rft.au=Sakthivel%2C+Kabilan&rft.au=Lal%2C+Shashi+Bhushan&rft.au=Srivastava%2C+Sudhir&rft.au=Chaturvedi%2C+Krishna+Kumar&rft.date=2025-01-03&rft.eissn=1535-3907&rft.volume=24&rft.issue=1&rft.spage=158&rft_id=info:doi/10.1021%2Facs.jproteome.4c00552&rft_id=info%3Apmid%2F39659155&rft_id=info%3Apmid%2F39659155&rft.externalDocID=39659155
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1535-3907&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1535-3907&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1535-3907&client=summon