A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal app...
Saved in:
| Published in: | Journal of proteome research Vol. 24; no. 1; p. 158 |
|---|---|
| Main Authors: | , , , , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
United States
03.01.2025
|
| Subjects: | |
| ISSN: | 1535-3907, 1535-3907 |
| Online Access: | Get more information |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets. |
|---|---|
| AbstractList | Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets. Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets. |
| Author | Srivastava, Sudhir Jha, Girish Kumar Chaturvedi, Krishna Kumar Madival, Sharanbasappa D Khan, Yasin Jeshima Sakthivel, Kabilan Vaidhyanathan, Ramasubramanian Mishra, Dwijesh Chandra Lal, Shashi Bhushan |
| Author_xml | – sequence: 1 givenname: Kabilan surname: Sakthivel fullname: Sakthivel, Kabilan organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 2 givenname: Shashi Bhushan surname: Lal fullname: Lal, Shashi Bhushan organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 3 givenname: Sudhir orcidid: 0000-0003-4990-7693 surname: Srivastava fullname: Srivastava, Sudhir organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 4 givenname: Krishna Kumar surname: Chaturvedi fullname: Chaturvedi, Krishna Kumar organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 5 givenname: Yasin Jeshima surname: Khan fullname: Khan, Yasin Jeshima organization: Division of Genomic Resources, ICAR-National Bureau of Plant Genetic Resources, New Delhi 110012, India – sequence: 6 givenname: Dwijesh Chandra surname: Mishra fullname: Mishra, Dwijesh Chandra organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 7 givenname: Sharanbasappa D surname: Madival fullname: Madival, Sharanbasappa D organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India – sequence: 8 givenname: Ramasubramanian surname: Vaidhyanathan fullname: Vaidhyanathan, Ramasubramanian organization: Research Systems Management, ICAR-National Academy of Agricultural Research Management, Hyderabad 500030, India – sequence: 9 givenname: Girish Kumar surname: Jha fullname: Jha, Girish Kumar organization: Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39659155$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkMtOwzAQRS1URB_wCSAv2aTYcZw0y1JaqFQeErCOJs6EukriEDsS5R_4ZwIpEqt56M7RvTMmg8pUSMg5Z1POfH4Fyk53dWMcmhKngWJMSv-IjLgU0hMxiwb_-iEZW7tjjMuIiRMyFHEoYy7liHzN6bMDp63TCgo6rzskqC3NTUPXGVZO53tdvVG3RXqN1tGFKVNddRemoianD6YpodCf_QKqjK7LunX9eI9uazL7C9tAioW3ahDpU-9aK0uXH3WD1v6Ib8DBKTnOobB4dqgT8rpavizuvM3j7Xox33ggJHfejKdZngPIDAPIglwpESkIIjZjIGdZjoqlsS8DGYHyFYbAMRRBGmEqUAjg_oRc9twu7XvbxUpKbRUWBVRoWpsIHoShH8b-rJNeHKRtWmKW1I0uodknfy_0vwHGlHv4 |
| CitedBy_id | crossref_primary_10_1016_j_watres_2025_124438 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1021/acs.jproteome.4c00552 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Chemistry |
| EISSN | 1535-3907 |
| ExternalDocumentID | 39659155 |
| Genre | Journal Article |
| GroupedDBID | --- 4.4 53G 55A 5GY 5VS 7~N AABXI AAHBH ABJNI ABMVS ABQRX ABUCX ACGFS ACS ADHLV AEESW AENEX AFEFF AHGAQ ALMA_UNASSIGNED_HOLDINGS AQSVZ BAANH CGR CS3 CUPRZ CUY CVF DU5 EBS ECM ED~ EIF F5P GGK GNL IH9 IHE JG~ NPM P2P RNS ROL UI2 VF5 VG9 W1F 7X8 ABBLG ABLBI |
| ID | FETCH-LOGICAL-a351t-81bdffaa5de4ad4fcc37ca47080a58dfec0b925457ac2ce6a1e634b7eb3e33a12 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001374971900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1535-3907 |
| IngestDate | Thu Jul 10 23:01:07 EDT 2025 Fri Apr 25 03:24:11 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Keywords | quality control normalization bottom-up approach protein differential expression analysis label-free proteomics missing value imputation |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a351t-81bdffaa5de4ad4fcc37ca47080a58dfec0b925457ac2ce6a1e634b7eb3e33a12 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0000-0003-4990-7693 |
| PMID | 39659155 |
| PQID | 3146626928 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_3146626928 pubmed_primary_39659155 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-01-03 |
| PublicationDateYYYYMMDD | 2025-01-03 |
| PublicationDate_xml | – month: 01 year: 2025 text: 2025-01-03 day: 03 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States |
| PublicationTitle | Journal of proteome research |
| PublicationTitleAlternate | J Proteome Res |
| PublicationYear | 2025 |
| SSID | ssj0015703 |
| Score | 2.459144 |
| Snippet | Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 158 |
| SubjectTerms | Algorithms Data Interpretation, Statistical Humans Least-Squares Analysis Linear Models Proteomics - methods Proteomics - standards Proteomics - statistics & numerical data |
| Title | A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/39659155 https://www.proquest.com/docview/3146626928 |
| Volume | 24 |
| WOSCitedRecordID | wos001374971900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8QwEA6-QC--3w8ieI1uk9S0J_G1KOiyB4W9LdNkAsraqlXxT_ifnaRdPAmCl0KhLyaT6TfJzPcxdqAzZ7DTQZFLMIKcAgSYLBU56DxxhUyKrIhiE6bXywaDvN8uuNVtWeU4JsZA7Sob1siPFE1pAt-5zE6eX0RQjQq7q62ExiSbVgRlQkmXGfzsIgR2qYYvNRWU25txB49MjsDWh4-RCqF6wkNtAxeV_B1lxr9Nd-G_37nI5lucyU8bx1hiE1gus9nzsbzbCvs65QFoRp7mcGHLLc4JxPKmezd2QHECiPyMXs8pclAWHQeSV573AtgdtV2cHErHr4M-RHN6G3Wp6_iwGyhwJLqviLzfWOLB1vzys63ALfkFvMEqu-9e3p1fiVabQYBKkzdBaNd5D5A61OC0t1YZC9oQAIU0cx5tp8gp-UwNWGnxGBI8VrowlLujUpDINTZVViVuME4YzBuCgYlXSqOURYZp5iUWoJXTCjfZ_tjSQ7JR2NCAEqv3evhj60223gzX8Lkh6RiqwJRIYGnrD3dvszkZZH1jddkOm_Y083GXzdgPGoXXvehUdOz1b78BL0_bTA |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Statistical+Approach+for+Identifying+the+Best+Combination+of+Normalization+and+Imputation+Methods+for+Label-Free+Proteomics+Expression+Data&rft.jtitle=Journal+of+proteome+research&rft.au=Sakthivel%2C+Kabilan&rft.au=Lal%2C+Shashi+Bhushan&rft.au=Srivastava%2C+Sudhir&rft.au=Chaturvedi%2C+Krishna+Kumar&rft.date=2025-01-03&rft.issn=1535-3907&rft.eissn=1535-3907&rft.volume=24&rft.issue=1&rft.spage=158&rft_id=info:doi/10.1021%2Facs.jproteome.4c00552&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1535-3907&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1535-3907&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1535-3907&client=summon |