R/parallel Parallel Computing for R in non‐dedicated environments
Gespeichert in:
| Titel: | R/parallel Parallel Computing for R in non‐dedicated environments |
|---|---|
| Autoren: | Vera Rodríguez, Gonzalo |
| Weitere Verfasser: | University/Department: Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius |
| Thesis Advisors: | Suppi Boldrito, Remo |
| Quelle: | TDX (Tesis Doctorals en Xarxa) |
| Verlagsinformationen: | Universitat Autònoma de Barcelona, 2010. |
| Publikationsjahr: | 2010 |
| Beschreibung: | 136 p. |
| Original Identifier: | B-23719-2013 |
| Schlagwörter: | Parallel loops, Oportunistic computing, Bioinformatics, Tecnologies |
| Beschreibung: | Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage co Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage complex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. mplex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. |
| Publikationsart: | Dissertation/Thesis |
| Dateibeschreibung: | application/pdf |
| Sprache: | English |
| ISBN: | 978-84-490-3906-5 84-490-3906-1 |
| Zugangs-URL: | http://hdl.handle.net/10803/121248 |
| Rights: | ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs. |
| Dokumentencode: | edstdx.10803.121248 |
| Datenbank: | TDX |
| FullText | Text: Availability: 0 CustomLinks: – Url: http://hdl.handle.net/10803/121248# Name: EDS - TDX (s4221598) Category: fullText Text: View record in TDX |
|---|---|
| Header | DbId: edstdx DbLabel: TDX An: edstdx.10803.121248 RelevancyScore: 1296 AccessLevel: 3 PubType: Dissertation/ Thesis PubTypeId: dissertation PreciseRelevancyScore: 1295.77868652344 |
| IllustrationInfo | |
| Items | – Name: Title Label: Title Group: Ti Data: R/parallel Parallel Computing for R in non‐dedicated environments – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Vera+Rodríguez%2C+Gonzalo%22">Vera Rodríguez, Gonzalo</searchLink> – Name: Author Label: Contributors Group: Au Data: University/Department: Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius – Name: Author Label: Thesis Advisors Group: Au Data: Suppi Boldrito, Remo – Name: TitleSource Label: Source Group: Src Data: TDX (Tesis Doctorals en Xarxa) – Name: Publisher Label: Publisher Information Group: PubInfo Data: Universitat Autònoma de Barcelona, 2010. – Name: DatePubCY Label: Publication Year Group: Date Data: 2010 – Name: PhysDesc Label: Physical Description Group: PhysDesc Data: 136 p. – Name: AN Label: Original Identifier Group: ID Data: B-23719-2013 – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Parallel+loops%22">Parallel loops</searchLink><br /><searchLink fieldCode="DE" term="%22Oportunistic+computing%22">Oportunistic computing</searchLink><br /><searchLink fieldCode="DE" term="%22Bioinformatics%22">Bioinformatics</searchLink><br /><searchLink fieldCode="DE" term="%22Tecnologies%22">Tecnologies</searchLink> – Name: Abstract Label: Description Group: Ab Data: Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage co Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage complex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. mplex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. – Name: TypeDocument Label: Document Type Group: TypDoc Data: Dissertation/Thesis – Name: Format Label: File Description Group: SrcInfo Data: application/pdf – Name: Language Label: Language Group: Lang Data: English – Name: ISBN Label: ISBN Group: ISBN Data: 978-84-490-3906-5<br />84-490-3906-1 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="http://hdl.handle.net/10803/121248" linkWindow="_blank">http://hdl.handle.net/10803/121248</link> – Name: Copyright Label: Rights Group: Cpyrght Data: ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs. – Name: AN Label: Accession Number Group: ID Data: edstdx.10803.121248 |
| PLink | https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edstdx&AN=edstdx.10803.121248 |
| RecordInfo | BibRecord: BibEntity: Languages: – Text: English PhysicalDescription: Pagination: PageCount: 136 Subjects: – SubjectFull: Parallel loops Type: general – SubjectFull: Oportunistic computing Type: general – SubjectFull: Bioinformatics Type: general – SubjectFull: Tecnologies Type: general Titles: – TitleFull: R/parallel Parallel Computing for R in non‐dedicated environments Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Vera Rodríguez, Gonzalo IsPartOfRelationships: – BibEntity: Dates: – D: 21 M: 07 Type: published Y: 2010 Identifiers: – Type: isbn-print Value: 9788449039065 – Type: isbn-print Value: 8449039061 |
| ResultId | 1 |