Mining tweets of Moroccan users using the framework Hadoop, NLP, K-means and basemap

The information revolution and exactly the explosion of Web 2.0 platforms such as discussion forums, blogs, and social networks allow users to share ideas and opinions, express their feelings and much more. This revolution leads to an accumulation of an enormous amount of data that may contain a lot...

Full description

Saved in:
Bibliographic Details
Published in:2017 Intelligent Systems and Computer Vision (ISCV) pp. 1 - 7
Main Authors: El Abdouli, Abdeljalil, Hassouni, Larbi, Anoun, Houda
Format: Conference Proceeding
Language:English
Published: IEEE 01.04.2017
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The information revolution and exactly the explosion of Web 2.0 platforms such as discussion forums, blogs, and social networks allow users to share ideas and opinions, express their feelings and much more. This revolution leads to an accumulation of an enormous amount of data that may contain a lot of valuable information. Much work has focused on analyzing these data, in particular those provided from social networks platforms like Twitter. In this paper, our objective is to propose an approach for analyzing the data generated by Moroccan users in the social network Twitter, in order to discover the subjects that interest Moroccan society and then locate on Moroccan map the areas from where come the tweets related to these topics. Analyzing the tweets of Moroccan users is a real challenge for two main reasons. Firstly, Moroccan users utilize for their communication in Twitter a variety of languages and dialects, such as Standard Arabic, Moroccan Arabic "Darija", Moroccan Amazigh dialect "Tamazight", French, Spanish, and English. Secondly, the Moroccan tweets contain a lot of URLs, #hashtags, spelling mistakes, reduced syntactic structures, and many abbreviations. In this paper, we propose an approach for detecting the relevant subjects related to Moroccan users by extracting the data automatically, and storing it in a distributed file system using HDFS (Hadoop Distributed File System) of Framework Apache Hadoop. Then we preprocess this raw data and analyze it by developing a distributed program using three tools, MapReduce of Framework Apache Hadoop, Python language, and Natural Language Processing (NLP) techniques. Afterward, we convert the corpus generated by the previous step into numeric features, and apply the k-means algorithm to cluster all words into general topics. Finally, we plot tweets on our Moroccan map by using the coordinates extracted from them, in order to have an idea about the geolocation of these subjects.
AbstractList The information revolution and exactly the explosion of Web 2.0 platforms such as discussion forums, blogs, and social networks allow users to share ideas and opinions, express their feelings and much more. This revolution leads to an accumulation of an enormous amount of data that may contain a lot of valuable information. Much work has focused on analyzing these data, in particular those provided from social networks platforms like Twitter. In this paper, our objective is to propose an approach for analyzing the data generated by Moroccan users in the social network Twitter, in order to discover the subjects that interest Moroccan society and then locate on Moroccan map the areas from where come the tweets related to these topics. Analyzing the tweets of Moroccan users is a real challenge for two main reasons. Firstly, Moroccan users utilize for their communication in Twitter a variety of languages and dialects, such as Standard Arabic, Moroccan Arabic "Darija", Moroccan Amazigh dialect "Tamazight", French, Spanish, and English. Secondly, the Moroccan tweets contain a lot of URLs, #hashtags, spelling mistakes, reduced syntactic structures, and many abbreviations. In this paper, we propose an approach for detecting the relevant subjects related to Moroccan users by extracting the data automatically, and storing it in a distributed file system using HDFS (Hadoop Distributed File System) of Framework Apache Hadoop. Then we preprocess this raw data and analyze it by developing a distributed program using three tools, MapReduce of Framework Apache Hadoop, Python language, and Natural Language Processing (NLP) techniques. Afterward, we convert the corpus generated by the previous step into numeric features, and apply the k-means algorithm to cluster all words into general topics. Finally, we plot tweets on our Moroccan map by using the coordinates extracted from them, in order to have an idea about the geolocation of these subjects.
Author Hassouni, Larbi
El Abdouli, Abdeljalil
Anoun, Houda
Author_xml – sequence: 1
  givenname: Abdeljalil
  surname: El Abdouli
  fullname: El Abdouli, Abdeljalil
  email: elabdouli.abdeljalil@gmail.com
  organization: RITM Lab., Hassan II Univ. of Casablanca, Casablanca, Morocco
– sequence: 2
  givenname: Larbi
  surname: Hassouni
  fullname: Hassouni, Larbi
  email: lhassouni@hotmail.com
  organization: RITM Lab., Hassan II Univ. of Casablanca, Casablanca, Morocco
– sequence: 3
  givenname: Houda
  surname: Anoun
  fullname: Anoun, Houda
  email: houda.anoun@gmail.com
  organization: RITM Lab., Hassan II Univ. of Casablanca, Casablanca, Morocco
BookMark eNotj9FKwzAUhiPohc69gN7kAdZ6kiZrcjmKumGngsXbcdKeaNEmJZ0M317R3fzfxQcf_BfsNMRAjF0JyIUAe7N5WVWvuQRR5ga0slCesLktjdBgQcFS2nPWbPvQhze-PxDtJx4938YU2xYD_5ooTb_7p9-J-4QDHWL64GvsYhwX_LF-XvCHbCAME8fQcYcTDThesjOPnxPNj5yx5u62qdZZ_XS_qVZ11lvYZ05pIxFAIyrjSvLOyU6Sb8tCCqELiyA1eDRKOLJWYYeKlFFKeNBuKYoZu_7P9kS0G1M_YPreHa8WPxj0TMY
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ISACV.2017.8054907
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781509040629
1509040625
EndPage 7
ExternalDocumentID 8054907
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-b4582a005aa48b7efbb2d2efc73211539a0250fa841be994ada4e48441f05b613
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:09 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-b4582a005aa48b7efbb2d2efc73211539a0250fa841be994ada4e48441f05b613
PageCount 7
ParticipantIDs ieee_primary_8054907
PublicationCentury 2000
PublicationDate 2017-April
PublicationDateYYYYMMDD 2017-04-01
PublicationDate_xml – month: 04
  year: 2017
  text: 2017-April
PublicationDecade 2010
PublicationTitle 2017 Intelligent Systems and Computer Vision (ISCV)
PublicationTitleAbbrev ISACV
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.671958
Snippet The information revolution and exactly the explosion of Web 2.0 platforms such as discussion forums, blogs, and social networks allow users to share ideas and...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Clustering algorithms
Distributed databases
Distributed program
File systems
Framework Hadoop
HDFS
K-means
MapReduce
Natural language processing
Python Language
Twitter
Title Mining tweets of Moroccan users using the framework Hadoop, NLP, K-means and basemap
URI https://ieeexplore.ieee.org/document/8054907
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA21ePCk0opalRw8Nu1-ZJvNUYpFsS0Fi_RWJpsJ9tDd0m79_U62HyJ48RaSQGBC8l6S9zKMPSqNMlIohcKeE9KFtObCFAWR1zDuYWAyqLKWDNV4nM5melJj7aMXBhEr8Rl2fLF6y7dFtvVXZd2U-IX21vETpdTOq3XwwQS6-_r-1P_wYi3V2Xf8lTGlAozB-f-GumDNH-cdnxwx5ZLVMG-w6ajK4sC9oqrc8MLxUUF9KCjc3zFsuBevU_MncnfQWnHaUopi1ebj4aTN38QSCZM45JZ73FrCqsmmg-dp_0XssyGIhQ5KYfwDF9CaAZCpUeiMiWyELlMxneGSWINnMw5SGRrUWoIFiTIltuOCxBBoX7F6XuR4zXhiXRypBBFcJhMX6sSGEWSgrSU6Y3s3rOEDMl_t_ruY72Nx-3d1i535mO_ULHesXq63eM9Os69ysVk_VJP0DX8clQ8
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pa8MgGJXSDbbTNtqx3_OwY9PGRGs8jrLS0jQUFkZvReMn66FJaNP9_dP0xxjsspuoIHyi76nv-SH0wgXQgAP1OPSNRw2xa45E4FnySsI--CqTddaSmCdJNJ-LWQN1jl4YAKjFZ9B1xfotXxfZ1l2V9SLLL4Szjp8wSgOyc2sdnDC-6I3fXwcfTq7Fu_uuv3Km1JAxvPjfYJeo_eO9w7MjqlyhBuQtlE7rPA7YaaqqDS4Mnha2jw0LdrcMG-zk67b5E7A5qK2w3VSKouzgJJ518MRbgUUlLHONHXKtZNlG6fAtHYy8fT4Ebyn8ylPuiUvaVSMljRQHo1SgAzAZD-0pjoVCOj5jZESJAiGo1JICjSzfMT5TFravUTMvcrhBmGkTBpwBSJNRZohgmgQyk0JrS2h0_xa1XEAW5e7Hi8U-Fnd_Vz-js1E6jRfxOJnco3MX_5225QE1q_UWHtFp9lUtN-unesK-AW7HmFY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+Intelligent+Systems+and+Computer+Vision+%28ISCV%29&rft.atitle=Mining+tweets+of+Moroccan+users+using+the+framework+Hadoop%2C+NLP%2C+K-means+and+basemap&rft.au=El+Abdouli%2C+Abdeljalil&rft.au=Hassouni%2C+Larbi&rft.au=Anoun%2C+Houda&rft.date=2017-04-01&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FISACV.2017.8054907&rft.externalDocID=8054907