Area: Deep Learning
Download 1.75 Mb. Pdf ko'rish
|
0-zenodo-spam-classifier
Universitat Oberta de Catalunya (UOC) MSc in Data Science Master Thesis Area: Deep Learning Spam detection on digital repositories using deep neural networks Zenodo’s use case —————————————————————————– Autor: Pablo Panero Tutor: Anna Bosch Ru´ e Profesor: Jordi Casas Roma —————————————————————————– Geneva, December 18, 2020 Cr´ editos/Copyright The content of this thesis is under as Attribution-NonCommercial-NoDerivs 3.0 Unported CreativeCommons license. However, the produced code is under MIT license with copyright to Pablo Panero. i ii THESIS FILE Title Spam detection on digital repositories using deep neural networks Author’s name Pablo Panero Tutor’s name Anna Bosch Ru´ e Professor’s name Jordi Casas Roma Delivery date (mm/aaaa) 01/2021 Degree MSc Data Science Thesis’ area Deep learning Thesis’ language English Keywords Spam Detection, Neural Networks, Deep Learning iii iv FICHA DEL TRABAJO FINAL T´ıtulo del trabajo Detecci´ on de spam en repositorios digitales utilizando redes neuronales profundas Nombre del autor Pablo Panero Nombre del colaborador/a docente Anna Bosch Ru´ e Nombre del PRA Jordi Casas Roma Fecha de entrega (mm/aaaa) 01/2021 Titulaci´ on o programa Master en Ciencia de Datos ´ Area del Trabajo Final Aprendizaje profundo Idioma del trabajo ingl´ es Palabras clave Detecci´ on de spam, redes neuronales, aprendizaje profundo v vi Citation “Spam is a waste of the receivers’ time, and, a waste of the sender’s optimism.” — Mokokoma Mokhonoana vii viii Acknowledgments I want to thank my tutor, Anna Bosch Ru´ e, for supervising and supporting my thesis; CERN for giving me the chance to contribute to Zenodo; the Zenodo and Invenio team for their help and patience answering my questions, specially to Lars Holm Nielsen and Alexandros Ioannidis Pantopikos for the guidance and support. ix x Abstract Nobody wants to get something they do not want, and that is spam. Spam content has become a big problem in our digital era, and therefore it also affects digital repositories. Hosting spam can have a big impact on a service. From the actual hardware costs of storing it, getting skewed usage statistics, including distribution of material that violates copyright, to finally and most important serving undesired advertisement to users. Zenodo is a catch-all open digital repository, and as such it is a target for spam. Zenodo’s current spam classifier is not performant enough in terms of accuracy, and requires human intervention to classify between 30 and 500 entries per day. Moreover, the workflow set to both run and train the classifier is not optimal: content is not classified in real time and the model is retrained in an ad-hoc manner. This situation translates in many hours that the support team has to spend manually classifying content. In order to solve these problems this thesis proposes a deep neural network based classifier along with practical guidelines to set it up in a production environment and improve the workflow. Keywords: Spam Detection, Classifier, Machine Learning, Deep Learning, Neural Net- works, Digital Repository, Zenodo. xi xii Abstract Nadie quiere obtener algo que no desea, eso es spam. Este contenido no deseado se ha convertido en un gran problema en esta era digital, y por lo tanto tambi´ en para los repositorios digitales. Albergar spam puede tener un gran impacto en un servicio. Desde el propio coste en hardware para almacenarlo, su influencia sobre las estad´ısticas del servicio, incluyendo la distribuci´ on de material que incumple derechos de autor, hasta servir publicidad no deseada a los usuarios. Zenodo es un repositorio digital de car´ acter general, y por lo tanto objetivo de spam. El clasificador utilizado actualmente por Zenodo no obtiene un rendimiento adecuado en t´ erminos de precisi´ on, requiriendo la intervenci´ on humana para la clasificaci´ on de entre 30 y 500 registros al d´ıa. Adem´ as, los flujos utilizados tanto para ejecutar como para entrenar el clasificador son sub´ optimos: Los registros no son clasificados en tiempo real, y el modelo es re-entrenado de manera aleatoria cuando alg´ un miembro del equipo de desarrollo lo considera adecuado. Esta situaci´ on se traduce en una gran cantidad de horas del equipo de soporte utilizadas clasificar manualmente el contenido. Para solventar estos problemas, este trabajo de fin de Master propone un clasificador basado en redes neuronales profundas, junto con gu´ıas y consejos pr´ acticos para poner el clasificador en producci´ on y mejorar los flujos de trabajo. Keywords: Detecci´ on de spam, Clasificador, Aprendizaje Autom´ atico, Redes Neuronales, Aprendizaje Profundo, Repositorios Digitales, Zenodo. xiii xiv Contents Abstract xi Abstract (Spanish) xiii Index xv List of figures xvii List of tables xix 1 Introduction 3 1.1 Problem’s description, relevance and justification . . . . . . . . . . . . . . . . . 3 1.1.1 Zenodo and the consequences of spam . . . . . . . . . . . . . . . . . . . 3 1.1.2 Zenodo’s spam detection mechanism . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Zenodo’s spam taxonomy: learnt from experience . . . . . . . . . . . . . 5 1.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.1 Task groups description . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.2 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 State of the art 17 2.1 Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Content based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 User or behaviour based . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Link based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Neural networks and deep learning approaches . . . . . . . . . . . . . . . . . . . 21 xv xvi CONTENTS 2.3.1 CNN and RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Design and implementation 27 3.1 Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Random Forest based models . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Neural Networks form the literature . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 CRNN on English-only content . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.3 Deeper Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.4 Testing the whole dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 Final conclusions and future work 69 4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Deployment in production 73 Bibliograf´ıa 74 A Deep Neural Networks configuration. 79 List of Figures 1.1 Zenodo’s user visits since October 2018. . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Zenodo’s spam report example file. . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Microsoft TDSP lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 PEC 1 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 PEC 2 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 PEC 3 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 PEC 4 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.8 PEC 5 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.9 Project’s Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Spam vs Ham in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Normalized number of keywords per class. . . . . . . . . . . . . . . . . . . . . . 32 3.3 Normalized number of files per class. . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Normalized amount of records with an image file. . . . . . . . . . . . . . . . . . 33 3.5 Normalized number of communities per class. . . . . . . . . . . . . . . . . . . . 34 3.6 Normalized number of creators per class. . . . . . . . . . . . . . . . . . . . . . . 35 3.7 Normalized records with a creator identified by ORCID. . . . . . . . . . . . . . 36 3.8 Normalized records with a creator with an affiliation. . . . . . . . . . . . . . . . 36 3.9 Normalized records by main resource type. . . . . . . . . . . . . . . . . . . . . . 37 3.10 Normalized records by resource type and subtype. . . . . . . . . . . . . . . . . . 38 3.11 Normalized records by license. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.12 Normalized number of title words per class. . . . . . . . . . . . . . . . . . . . . 40 3.13 Normalized number of description words per class. . . . . . . . . . . . . . . . . . 42 3.14 Normalized access right per class. . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.15 Model A confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.16 Model B confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.17 Model C confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.18 Balanced dataset 2D representation. . . . . . . . . . . . . . . . . . . . . . . . . . 48 xvii xviii LIST OF FIGURES 3.19 Balanced dataset 3D representation. . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.20 Literature configuration A CNN training accuracy and loss. . . . . . . . . . . . . 49 3.21 Literature configuration B CNN training accuracy and loss. . . . . . . . . . . . . 50 3.22 Balanced dataset 2D representation of TF-IDF vectorization. . . . . . . . . . . . 51 3.23 Balanced dataset 3D representation of TF-IDF vectorization. . . . . . . . . . . . 52 3.24 Literature configuration A with TF-IDF vectors training accuracy and loss. . . . 52 3.25 Literature configuration B TF-IDF vectors training accuracy and loss. . . . . . . 53 3.26 Literature configuration A RNN training accuracy and loss. . . . . . . . . . . . . 54 3.27 Literature configuration B RNN training accuracy and loss. . . . . . . . . . . . . 54 3.28 Literature configuration A CRNN training accuracy and loss. . . . . . . . . . . . 56 3.29 Literature configuration B CRNN training accuracy and loss. . . . . . . . . . . . 56 3.30 Language distribution on the full balanced dataset. . . . . . . . . . . . . . . . . 58 3.31 Language distribution on the ham records. . . . . . . . . . . . . . . . . . . . . . 59 3.32 Language distribution on the spam records. . . . . . . . . . . . . . . . . . . . . 60 3.33 English dataset PCA reduction to 2D. . . . . . . . . . . . . . . . . . . . . . . . 61 3.34 English dataset PCA reduction to 3D. . . . . . . . . . . . . . . . . . . . . . . . 61 3.35 CNN Configuration A on English content. . . . . . . . . . . . . . . . . . . . . . 62 3.36 CNN Configuration B on English content. . . . . . . . . . . . . . . . . . . . . . 62 3.37 CRNN Configuration A on English content. . . . . . . . . . . . . . . . . . . . . 63 3.38 CRNN Configuration B on English content. . . . . . . . . . . . . . . . . . . . . 63 3.39 RNN Configuration A on English content. . . . . . . . . . . . . . . . . . . . . . 63 3.40 RNN Configuration B on English content. . . . . . . . . . . . . . . . . . . . . . 64 3.41 Deep NN configuration I metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 List of Tables 1.1 PEC 1: Project definition and planning. . . . . . . . . . . . . . . . . . . . . . . 10 1.2 PEC 2: Literature review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 PEC 3: Design and implementation. . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 PEC 4: Thesis writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 PEC 5: Project defense and presentation. . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Public defense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Classical methods summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Neural Network methods summary. . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Attribute presence in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Number of files ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Number of creators ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Number of words in the title ranges. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Number of words in the description ranges. . . . . . . . . . . . . . . . . . . . . . 41 3.6 Random Forests comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Literature CNN Networks configuration. . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Literature CNN Networks training metrics. . . . . . . . . . . . . . . . . . . . . . 50 3.9 Literature CNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . . . 50 3.10 Literature CNN Networks with TF-IDF vectorization training metrics. . . . . . 51 3.11 Literature CNN Networks with TF-IDF vectorization test metrics. . . . . . . . . 53 3.12 Literature RNN Networks configuration. . . . . . . . . . . . . . . . . . . . . . . 53 3.13 Literature RNN Networks training metrics. . . . . . . . . . . . . . . . . . . . . . 55 3.14 Literature RNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . . . 55 3.15 Literature CRNN Networks configuration. . . . . . . . . . . . . . . . . . . . . . 55 3.16 Literature CRNN Networks training metrics. . . . . . . . . . . . . . . . . . . . . 56 3.17 Literature CRNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . . 57 3.18 Performance metrics of configuration A CRNN. . . . . . . . . . . . . . . . . . . 65 xix 3.19 Deeper Neural Networks metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.20 Performance metrics of configuration B CNN on the full dataset. . . . . . . . . . 67 3.21 Performance metrics of configuration B RNN on the full dataset. . . . . . . . . . 67 3.22 Performance metrics of configuration A CRNN on the full dataset. . . . . . . . . 67 3.23 Performance metrics of deep NN model VI on the full dataset. . . . . . . . . . . 68 3.24 Performance metrics of deep NN model XXVIII on the full dataset. . . . . . . . 68 3.25 Performance metrics of deep NN model XXX on the full dataset. . . . . . . . . . 68 2 LIST OF TABLES Chapter 1 Introduction 1.1 Problem’s description, relevance and justification 1.1.1 Zenodo and the consequences of spam Zenodo 1 is a catch-all open digital repository, which enables researchers to share their work: software, datasets, publications, presentations, any type of research artifact. One or a set of these artifacts conform what it is called a record. It can be thought as the set of all artifacts that relate to a single research publication. Nowadays, Zenodo is hosting more than 1.5 million records, including more than 75% of the world’s software Digital Object Identifiers 2 (DOIs), 2.2 million files and receives 1.4 million visitors per year. As mentioned, Zenodo is an open repository. This means that any person with an account can publish records, and getting an account only requires to fill in three fields: email, username and password. Openness and user experience are two highly valuable features. However, in this case they make Zenodo an easy and rewarding target for spam. In approximate terms, Zenodo receives between 100 and 1000 spam records every day, and as Zenodo’s popularity grows so does the amount of spam. This unwanted content has an important impact on Zenodo. First of all it affects user experience in one of the main features, research sharing. As a researcher you want to find the content you are looking for and not an advertisement, and you want your work to be found by others and not be overtaken by spam on the results page. Some of which might even be sharing multimedia content which violates copyrights. In addition, hosting this content means space in a database, network bandwidth to serve it, and other hardware usage costs that turn into monetary costs. Finally, it skews the service’s usage statistics. This means, for example, 1 https://zenodo.org 2 https://www.doi.org 3 4 Introduction that Zenodo can not say with a 100% certainty how many legitimate visitors is having per year. This is shown in figure 1.1 , where it can be seen that the amount of visitors is constantly increasing. However, there are two noticeable spikes at the end of the graph. The second one is legitimate, it was due to a publication related to COVID-19 that gained a lot of popularity. However, the first spike (around July-August 2020) was due to spam content being massively accessed. Figure 1.1: Zenodo’s user visits since October 2018. 1.1.2 Zenodo’s spam detection mechanism In first place Zenodo has a CAPTCHA on the user registration form. However, there is no restrictions on who can register and therefore fake users are created to submit the spam. In addition, a classifier has been set in place. Originally, it was based on Naive Bayes algorithms. However, at the beginning of 2019, efforts to improve resulted on a new classifier based on Random Forests, which has been used in production ever since. Its code is available on GitHub 3 . This classifier is run every working day (Monday to Friday) on the content generated during the previous three days. This allows to account for the weekend and sometimes catch false negatives if the model was retrained in the meantime. The detected records are written to a markdown file and stored on GitHub, an example of said file is shown in figure 1.2 (Note that parts of it had to be anonymised). Then, a member of the support team has to process them manually. In addition, users can report spam, a request that will be manually treated by a supporter. When doing so, if the supporter considers that there are obvious terms in the content that should have been caught (e.g. a movie) a manual search is performed. Most likely resulting in more spam records being detected and marked as spam. Moreover, if the task can be allocated, the model will be re-trained and re-deployed. Nonetheless, it has been noted that after each re- train, the classifier performs adequately for the span of days and then drops quickly its accuracy. 3 https://github.com/zenodo/zenodo-classifier/ 1.1. Problem’s description, relevance and justification 5 Figure 1.2: Zenodo’s spam report example file. Meaning, that the spam changes and adapts fast and the classifier does not generalize enough to catch it. The current classifier does not perform as required. From the 100 to 1000 spam records received every day, approximately half of them are caught by the classifier. As a result, a supporter has to spend a considerable amount of time manually checking 500 records, but there are other 500 that are being served to users as legitimate content. Moreover, these 500 records that were not detected will pose a challenge when processing the dataset, since it means that there is a unknown number of false negatives (spam records) marked as legitimate. 1.1.3 Zenodo’s spam taxonomy: learnt from experience When interviewing Zenodo’s team, it was observed that every member has learnt to identify the spam based on specific characteristics, which are presented in the list below. Nonetheless, most of these are not used by the current classifier, which only uses the the title and the abstract (also called description) of the record. • The spam in an advertisement. Most commonly for online games, live streaming of sport events and movies or TV shows. Therefore, they contain terms such as movie, watch 6 Introduction online, free, nfl live, poker. These terms can also be present in other languages (e.g. ver online or gratis for Spanish). • It contains one of the latest movie’s title (e.g. Tenet ). • Its language can vary a lot. However, the previous terms tend to also appear in English. • The user’s email of the the spam author is, or is related to, one of the previous terms, for example cinema12345@foo.com. • The keyword field of the publication contains a big amount of them, without a logical relation between them (e.g. computer security and covid-19 ). • It contains a single file in JPG, JPEG, PNG or other image format. • The publication type does not logically match the file content extensions. For example, a publication of type journal article which contains an image (e.g. PNG) but not PDF, which would be the logical extension for an article. • The user submitting the record has published only one. • The user account belongs to a new user. Most likely registered on the same day, and probably not long before the publication time. • The IP from which the users connect matches that of previously blocked accounts. Some of these features will be extracted from the dataset and fed to the classifier developed in this project. However, some are user related and for the moment it is not possible to use them due to privacy implications. 1.1.4 Conclusion Zenodo’s popularity and usage is growing, and with it the amount of spam it receives. The current mechanism and workflow result in a significant amount of time required from the support team, which is increasing and becoming unmanageable. This thesis aims at improving the model, solving the problems related to catching the spam, and providing guidelines that will enable developers to set up a new workflow that requires less human input, and therefore less support time. 1.2. Motivation 7 1.2 Motivation In first place, I believe science should be open. However, I do not think this should be at cost of quality of service. Therefore, an as a core developer of the Invenio framework 4 and Zenodo service it is in my interest to keep, if not increase, the quality of the service provided by Zenodo. On what is related to spam, this means giving the users the results they are looking for. In addition, lowering the time costs that manual spam classification incur would have a significant impact on the development team. Mixing both my interests in data science, specifically in neural networks, and software development for digital repositories makes this topic a perfect fit for the thesis. 1.3 Objectives In terms of the resulting product, the objectives of this thesis are: • Implement a classifier that improves the accuracy of the current one and properly gener- alizes. – Provide feedback on which data could be valuable for the classification, so service developers can work on make it available for future iterations of the spam classifier. – Provide a model, or set of them, which improves the current classification accuracy and is able to process the content in a feasible amount of time. Aiming at pseudo real-time content classification. • Enable service developers to set the new classifier up in production. – Provide a list with the hardware requirements to efficiently run the classifier. – Provide guidelines or instructions to deploy and run the classifier. In addition, the learning objectives are: • Work on a data science project, which involves real world stakeholders. • Gain experience using mixed architectures (i.e. using more than one neural network). 4 https://inveniosoftware.org/products/framework/ 8 Introduction 1.4 Methodology The chosen methodology to carry out this project is Microsoft Team Data Science Process 5 , from now on abbreviated as TDSP. TDSP focuses on the business needs and the value of the outcome. It establishes a clear structure to follow throughout the data science process. This process is cyclic, and does not finish, even if the model (outcome) is deployed. TDSP divides the process in four phases that can provide feedback to previous ones. Theses phases are described below and illustrated in figure 1.3 . 1. Business understanding: In this phase the idea or final output is defined from the busi- ness perspective, identifying and evaluating possible scenarios. In addition, the planning to deliver the solution is generated. 2. Data acquisition and understanding: This phase aims at getting familiar and finding facts about the data. Its outputs can feed back to the previous phase (business under- standing). Moreover, the foundations for the data flow are established. This means that clear connections with the different data sources will be set, along with means to extract the data from those. 3. Modeling: A model is created, built, and verified against the original business question or problem. The model needs to be able to respond the question or solve the problem, and also add business value (e.g. perform better than the existing solution). 4. Deployment The proposed solution is set in production along with means to monitor its performance, and a long-term maintenance plan. Note that due to time constraints, this project will not carry out this phase. However, guidelines and best practices on how to deploy the solution will be provided. 5 https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/ 1.4. Methodology 9 Figure 1.3: Microsoft TDSP lifecycle. 10 Introduction In terms of technologies and tools, this project will be implemented in Python using well recognised and supported libraries and frameworks such as NumPy, Pandas, Keras, TensorFlow and SciKit Learn. Python notebooks will be used at first, however the final output will be packaged using Python standards (e.g. setuptools). 1.5 Planning In this section the project planning is presented. The project has been divided in six task groups, one for each PEC plus one for the public defense. Each one of these task groups will be described along with the tasks it contains. At the end of the section a Gantt chart containing the whole planning is shown, however, for clarity a Gantt for each task group has been added with its description. 1.5.1 Task groups description Note that the amount of days shown for each task of the task groups is the total amount assigned to it, however it does not specify if the task overlaps with another in time. This can be seen in the Gantt charts. Title PEC 1: Project definition and planning Begin 16/09/2020 End 27/09/2020 Days 12 Outputs • A description of the problem, answering why it is important and why it should be solved. • A detailed planning of how the problem will be tackled. Tasks • Interview developers and supporters. Note that there is not difference between the two teams, the members are the same (4 days) • Interview service managers (3 days) • Write problem description and requirements (3 days) • Define planning (2 days) Table 1.1: PEC 1: Project definition and planning. 1.5. Planning 11 Figure 1.4: PEC 1 Gantt chart. Title PEC 2: Literature review Begin 28-09-2020 End 18-10-2020 Days 21 Outputs • A comprehensive review of the state of the are of the areas targeted by the project. Tasks • Review spam classic methods for spam detection (outside of the deep learning field) (5 days) • Review spam detection approaches using neural networks, emphasizing on deep neural networks (10 days) • Summarize findings from the three previous tasks (6 days) Table 1.2: PEC 2: Literature review. Figure 1.5: PEC 2 Gantt chart. 12 Introduction Title PEC 3: Design and implementation Begin 19-10-2020 End 20-12-2020 Days 63 Outputs • A neural network architecture for the classifier • A trained classifier • A set of guidelines to deploy the classifier in production • A set of guidelines to implement a workflow to keep the classifier running with the least possible human interaction Tasks Note that task group will be carried out several times, as an iterative process. However, the total amount of iterations will result in the specified amount of days (63). • Interview developers, supporters and service manager to gather design and user experience requirements (e.g. bias the model towards false negatives) (3 days) • Getting comfortable with the dataset. Carry out a first iteration of the descriptive analysis (12 days) • Dataset processing. Carry out normalization, feature extraction and other operations needed to ready the data for the modeling phase (12 days) • Test different models and architectures, train, test, refactor according to the evaluation results. Propose a final model/architecture (30 days) • Gather hardware and technologies requirements. This task should allow us to understand what are the constraints under which the classifier will be run (3 days) • Based on the obtained architecture and the hardware requirements, make a deployment proposal (5 days) • Document the design and implementation documentation (5 days) • Buffer: Set aside for possible unexpected delays (3 days) Table 1.3: PEC 3: Design and implementation. 1.5. Planning 13 Figure 1.6: PEC 3 Gantt chart. Title PEC 4: Thesis writing Begin 21-12-2020 End 03-01-2020 Days 14 Outputs • Final version of the thesis in the appropriate format (PDF) Tasks • Finish writing the outcome of the previous task group (PEC 3) (3 days) • Document results and conclusions and add them to the corresponding sections (7 days) • Review the whole content of the thesis and apply corresponding changes (3 days) • Set aside for possible unexpected delays (1 day) Table 1.4: PEC 4: Thesis writing. Figure 1.7: PEC 4 Gantt chart. 14 Introduction Title PEC 5: Project defense and presentation Begin 04-01-2020 End 10-01-2020 Days 7 Outputs • Set of slides to present the thesis to the jury Tasks • Prepare the set of slides for the presentation (4 days) • Practical comments to allow a better presentation flow using presenter view (2 day) • Defend the thesis (1 day) Table 1.5: PEC 5: Project defense and presentation. Figure 1.8: PEC 5 Gantt chart. Title Public defense Begin 11-01-2020 End 20-01-2020 Days 10 Outputs • Outcome of the thesis Tasks • Defend the thesis (1 day) Table 1.6: Public defense. 1.5. Planning 15 1.5.2 Gantt chart The black lines represent the task groups, which map one to one to the PECs. Both the task groups and their tasks follow the same order than the one used to describe them in the previous section. Figure 1.9: Project’s Gantt chart. 16 Introduction Chapter 2 State of the art Before going in details on how the problem of spam has been approached, let’s dig a bit on the consequences it has had on the past years. Firstly, on the monetary side, already in 2005 it was estimated that some spammer organizations could earn between US$ 2M to US$ 3M per year. In extreme cases, a single spamming botnet could earn close to US$ 2M per day as claimed by one IBM representative [ 14 ]. A few years later, in 2007, the cost of web spam was estimated at US$ 100B globally [ 2 ], going up to US$ 130B in 2009 [ 13 ]. Switching to the technical side of spam, in the first half of 2013 it was already growing at a trend of 355%, with 1 in each 21 social messages being unwanted or spam [ 22 ]. Therefore, affecting many services, including well known social networks such as Twitter. For example, the attack they suffered on February 20th 2010, which made the social network have to disable their trending topic feature until they were able to handle the spam, graphical proof of this is show by Wang, A. H.[ 31 ]. Furthermore, recent studies estimate that between 9% and 15% of Twitter’s user accounts are bots, most of these used to distribute undesired content [ 30 ]. The literature review is presented in the following sections. Firstly, relevant work using classical algorithms such as Decision Tress or Naive Bayes are presented. Then, work related to spam detection using neural networks, specifically deep neural networks is reviewed. To conclude, a table with a summary of the presented methods is shown. 2.1 Classical methods There has been many approaches to detecting spam on wide variety of content such as web pages, email and social networks, many of these using machine learning algorithms. Yu et al. [ 34 ] made a comparative on email spam classification between Naive Bayes (NB), Support Vector Machines (SVM), Relevance Vector Machines (RVM) and a Neural Network (A standard non-linear feed-forward network with the sigmoid activation function). Yu et al. [ 34 ] claim that 17 18 State of the art a neural network (NN) is unsuitable to be used for the task. They stated that besides the longer training and parameter selection time, the main problem of the neural network was the tendency to overfit. However, the authors acknowledge that this was most likely due to the small size and unbalanced nature of the dataset (6K emails). The NN obtained an average accuracy of 88%, while the other three methods were above 90%. Both SVM and RVM were close to a 94% accuracy, with RVM being the preferred one due to the usage of less vectors and much faster testing time. Renuka et al. [ 24 ] carried out another comparative on email spam classification, on a dataset containing 4601 emails. Renuka et al. [ 24 ] compared J48 (A Java implementation of C4.5), Naive Bayes using Filtered Bayesian Learning (FBL) to increase its performance, and a multilayer perceptron (MLP). In this case the MLP obtained the highest accuracy with a 93%, closely followed by J48 (92%). The NB approach obtained a 89% accuracy. In addition, MLP obtained a 0% of false positives, a highly important quality as stated by the authors. Nonetheless, a considerable drawback of the MLP approach was the training time, it took 9.48s compared to 0.06s and 0.02s of J48 and NB respectively. In the field of web spam classification, Goh et al. [ 8 ] made an extensive comparative between SVM, MLP, NB, Bayesian Networks (BN), Decision Trees (C4.5), Random Forests (RF) and K-Nearest Neighbours (KNN), using the area under the curve (AUC) as spamicity measure. In addition, Goh et al. [ 8 ] used extra techniques such as boosting, bagging, rotation forest and stacking to improve performance. The authors state that, in spite of the high accuracy of SVM they would be greatly affected by contaminated datasets, proposing MLP as an alternative. In this comparative the best performing algorithm was the RF with Real AdaBoost, achieving a 93.7% AUC. Nonetheless, the second best was MLP with an average difference to RF of a 5.4% AUC. Another spam targeted field are social networks. McCord et al. [ 19 ] compared NB, SVM KNN and RF on a set of 100K tweets (distributed evenly from 1K accounts). The authors show that Random Forest was the algorithm with better precision and F-measure, with a 95.7%. McCord et al. believe that one of the reasons it outperformed the others was due to its ability to deal with unbalanced datasets. One could question if clustering algorithms have a place on the spam detection field. Goh et al.[ 8 ] and McCord et al. [ 19 ] already added KNN to their comparative on web and Twitter spam classification. Similar work has been done on the email field by Trudigan et a. [ 29 ], using a dataset of 1K emails. The authors compare KNN (using an approximate neighbour variation) to DT and NN (with 5 hidden layers). KNN achieved good results, with an average of 90% of accuracy, however still below the other two algorithms. In addition, the NN was the only one to achieve a 0% of false positives. On the other hand, it was in average 4% less accurate than the DT (C4.5) who reached a 98% accuracy. Nonetheless, the authors believe that with further 2.2. Feature extraction 19 optimization on the NN parameters and a larger dataset, it would match the accuracy of the DT. From the presented literature it can be understood that at first neural networks were not suitable and algorithms such as SVM or RVM were preferred. However, with time and research neural networks started gaining importance and to be seen as potential alternatives for the spam detection problem. Random Forest has been established as the best performing algorithm, closely followed by MLP. 2.2 Feature extraction As stated by Trudgian et al. [ 29 ] ”The feature extraction engine is sometimes considered to be more important than the classification algorithm in the problem of spam filtering” and therefore a variety of approaches are presented below. From the literature it can be observed that there are three main types of features: content based, link based and user or behaviour based. 2.2.1 Content based Content based are those extracted from the corpus of the entity to classify (an email, a web page, a tweet, etc.). Renuka et al. [ 24 ] used only content based features on their comparative. These features are some of the ones that would first come to mind: number of words in the text body, number of words in the title/subject, average word length, independent trigram likelihood and entropy of trigrams. In the field of web spam Goh et al. [ 8 ] used fraction of anchor text, visible text corpus precision and corpus recall, query precision and query recall. Trudigan et al. [ 29 ] used a word list approach. This word list contains 150 spam and 50 non-spam words, because it resulted in higher accuracy than a balanced list 94.3% vs 88.7% and less false positives. The creation of the word list involves a pseudo-probabilistic method, namely the sum (for every message) of the amount of times a word is in a message divided by total amount of messages. A similar approach was used by McCord et al. [ 19 ], who used the weighted difference of a custom made list of words, and stated that the word variance between spam and non-spam messages is significantly different. This fact is backed up by a similar affirmation from Spirin et al. [ 26 ]. Yu et al. [ 34 ] used LINGER [ 5 ] for content based feature extraction. LINGER is a neural network approach that supports the bag of words representation, where all unique terms (e.g. numbers, words, special symbols) in the training corpus are identified and each of them is treated as a single feature. Then, a feature selection is applied to choose the most important words and reduce dimensionality. Each document will be represented by a vector that contains a normalized weighting for every word according to its importance. 20 State of the art Another approach to content based feature extraction is language modeling. Martinez et al. [ 18 ] used it to obtain 42 features on two different web spam datasets. Using said features the authors obtained a 6% and 2% F-measure improvement using a simple decision tree classifier. Spirin et al. [ 26 ] used language models and detected spam web pages with a 97.2% accuracy. The underlying idea being that these models are likely to be substantially different for a blog and a spam page due to random nature of spam comments. In addition, Spirin et al. [ 26 ] investigated the possibility of using online services as features. For example: Online Commercial Intention (OCI) value assigned to a URL by Microsoft Ad- Center, Yahoo! Mindset classification of a page as either commercial or non-commercial, Google AdWords popular keywords, and number of Google AdSense ads on a page. They report an increase in accuracy by 3%. In this section several features and ways of extracting them were presented. Many of them are simple and/or easy to extract such as number of words in the text, and have proven to be useful. In previous section some characteristics of Zenodo spam were introduced, one of those was the frequent appearance of certain terms. Therefore, the the word list feature, as the ones used by Trudigan et al. [ 29 ] and McCord et al. [ 19 ], could be highly relevant for the Zenodo classifier. 2.2.2 User or behaviour based McCord et al. [ 19 ] found out an interesting pattern on spammers’ behaviour. It was shown that they post during the early morning and legitimate users during the afternoon. Therefore, the authors added the distribution of tweets over the 24-hour period as a feature, divided in 8 slots of 3 hours. Xie et al. [ 33 ] took the approach of analyzing temporal patterns in order to treat the spam in web reviews. This problem occurs when reviews are used to lower or increase a certain place or service rating. Each of these reviews come from a unique user, and therefore it is called the ”singleton” review problem. By transforming this problem to a temporal pattern discovery problem, the authors identify three aggregate statistics which are indicative of this type of spam attack, then they construct a multidimensional time series using these statistics. Finally, a multi-scale anomaly detection algorithm on multidimensional time series based on curve fitting is design. The results show that the proposed algorithm is effective in detecting singleton review spam attacks. Nonetheless, note that the validation process had to be done manually since there was no already classified reviews dataset available. Browser tracking features were explored by Spirin et al. [ 26 ] but were discarded due to its privacy implications. In this section it was shown how spammers tend to behave in similar ways, be it automated 2.3. Neural networks and deep learning approaches 21 as in the study of Xie et al. [ 33 ] or when there are people behind it as in the case shown by McCord et al. [ 19 ]. This could be an interesting feature for the Zenodo spam classifier where the spam is also manually introduced by humans, in a very similar fashion to the singleton review problem (i.e. one spam publication per user) as mentioned in the Zenodo spam taxonomy in the previous chapter. 2.2.3 Link based The idea behind link based features is similar to that of the word list presented in the content based feature section. Legitimate web pages or content links to other legitimate ones, while spam content links to other spam ones. Most of the approaches are based on the concept of trust, based on the incoming or outgoing links and their category (spam or not). Benczur et al. [ 3 ] obtained almost 100% accuracy using a propagation of anti-trust levels on the graph representing the web pages. However, the dataset contained only a 0.28% of spam pages. Similar work was carried out by Krishnan et al. [ 15 ]. Castillo et al. [ 4 ] obtained a 88.4% F-measure with a 6.3% of false positives using a similar link based features along with content based features. On the other hand, McCord et al. [ 19 ] pointed out that, while features like node degree and frequency can be very useful, they might heavily penalize new and young users. Related to link based features, but applicable with content or user is the work of Svore et al. [ 28 ], where the authors use rank-time features (those that helped a search engine decide that it had to return an item as a result). Improving the recall up to a 25% and obtaining a similar precision than the same SVM without rank-time features. Literature on link based features has not been explored in depth since it is not applicable to the Zenodo use case due to the lack of links in between publications. Citations could be used as an approximation. However, the number of citations as a feature is already enough, since spam does not cite and it is not cited by any other publication. 2.3 Neural networks and deep learning approaches In the previous sections classical classification methods were presented, along with three types of features that can be used to feed those methods. However, this approach requires in-depth field knowledge, it is highly time consuming and features can be bypassed once learnt by the spammers. Gao et al. [ 7 ] developed a deep learning approach to by pass spam detection methods, managing to reduce classification accuracy from around 90% to almost 40%. As mentioned before, Random Forest seem to be the algorithm that best performed across datasets. Profiting from that, Sumathi et al. [ 27 ] propose a combined architecture, which adds 22 State of the art a deep neural network. Helping extract knowledge from the text without the need of human input to identify features. Nonetheless, this approach achieved a 88.59% of accuracy on an unbalanced dataset containing 4597 emails from which only 1813 are spam. Another approach that could fall into the deep learning category was implemented by He et al. [ 9 ]. Using a linguistic attribute hierarchy (LAH) embedded with linguistic Decision Trees achieving a 91% AUC on SMS spam detection. It is worth mentioning that this approach helps tackle the curse of dimentionality problem, which is the main cause of not using Decision Trees. Barushka et al. [ 1 ] also state that spam detection algorithms cannot effectively deal with high-dimensional data and suffer from overfitting issues. To address this issues, the authors approach this problem using a n-gram TF-IDF feature selection, to feed a modified distribution- based balancing algorithm that will then be processed by a regularized deep MLP with rectified linear units. Another advantage of this approach is that no additional dimensionality reduc- tion is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. This method achieved a 96% AUC and was tested on SMS and Twitter data. Wu et al. [ 32 ] achieved similar results using a MLP and applying Word2Vec [ 20 ] to pre-process the input tweets, the resulting accuracy was between 92% and 99% depending on the dataset. 2.3.1 CNN and RNN Both Convolutional (CNN) and Recurrent (RNN) neural networks have been very successfully applied to the field of computer vision and speech recognition. Recent studies apply this techniques to the NLP (Natural Language Processing) field. CNNs would be helpful to capture local and temporal features including n-grams from texts, while RNNs would help to handle the long term dependency of word sequences. Roy et al. [ 25 ] made a comparison between CNN and RNN using long-short term memory (LSTM) units, using an unbalance SMS spam dataset (4827 messages, where 747 were spam). The authors used Glove [ 23 ] to vectorise the words. The best results were achieved with a 3 level CNN using dropout, which obtained an accuracy of 98.63%. The same model using a 10-fold cross validation obtained 99.44% accuracy. As for LSTM, the best results were also obtained when using dropout, it achieved a 96.76% accuracy. Following the state of the art in computer vision, Conneau et al. [ 6 ] studied the possibility of using deeper CNNs. Following the philosophy of VGG and ResNets, combining convolutions of 3, 5 and 7 tokens, The authors tested networks of 9, 17, 29 and 49 convolutional layers for classification tasks of 2 to 14 classes (considerably lower than those of computer vision). The aim of the study was to perform the processing at character level, in the same fashion that computer vision tasks would do at pixel level. The best results were obtained with 29 layers. This approach was proven successful with an average of 98% accuracy for big datasets (3M), 2.4. Summary 23 however, it performed poorly ( 60%) for unbalanced small datasets (100K to 700K). Madisetty et al. [ 17 ] proposed an approach that combined 5 CNNs and one feature based Random Forest to predict spam on Twitter datasets. Each CNN uses different word embeddings (Glove or Word2vec) of different sizes to train the model. The Random Forest uses content based, user based, and n-gram features. In order to combine the results from the 6 different models the authors used a meta classifier, using a multilayer neural network. The method was tested on both balanced and imbalanced datasets and obtained an average of 99.06% AUC. Nonetheless, it was shown that for big and balanced datasets the ensemble does not provide a significant improvement over individual modules, but it is capable of obtaining good results for small imbalanced datasets. Jain et al. [ 10 ][ 11 ] studied the possibility of adding semantic information to the data before training the models. The words are converted into word vectors using Word2Vec vectors of dimension 300. If the word is not present in Word2Vec, similar words are found using WordNet [ 21 ] and ConceptNet [ 16 ] dictionaries which can then be converted into word vectors. The authors applied this technique both to CNN [ 10 ] and RNN with LSTM [ 11 ]. In addition, Jain et all proposed an approach that uses both CNN and LSTM in a combined model [ 12 ] called Sequential Stacked CNN-LSTM (SSCL). The SSCL model extracts the local region features and n-gram information using the CNN and long term dependency information using the LSTM network. The accuracy of the SSCL model is increased by 0.36% in the case of SMS spam dataset and 1.08% in the case of Twitter dataset. When comparing the SSCL with SLSTM, there is a significant increase in performance of the spam detection on the Twitter dataset (0.39%). In case of SMS Spam dataset, there is an increase in precision, recall and F1 score. The SSCL model achieved a 99.01% accuracy on the SMS dataset and a 95.48% accuracy on the Twitter dataset. 2.4 Summary In this section a brief summary of the presented methods is shown. Many surveys and com- parisons were presented, but only the most relevant or best performing methods of each case are mentioned in the tables below. Note that features and its techniques are not included since they are usually the input of the classical methods. In addition, and for display friendliness the content is divided in two tables, one for classical and one for neural networks related methods. 24 State of the art Model Ref % Dataset Details RVM [ 34 ] 94% Acc 6K email - Low amount of vectors - Fast testing time J48 [ 24 ] 92% Acc 4.6K email - Fast training time (0.06s) MLP [ 24 ] 93% Acc 4.6K email - 0% of false positives - Slow training time (9.48s) RF [ 8 ] 93.7% AUC 5.7K web - Using Real AdaBoost RF [ 19 ] 95.7% Acc 100K Tweets - Works well with imbalanced datasets KNN [ 29 ] 90% Acc 1K emails - Using approximate neighbour variation Table 2.1: Classical methods summary. 2.4. Summary 25 Model Ref % Dataset Details RF+NN [ 27 ] 88.59% Acc 4.6K emails - NN to extract features from text LAH [ 9 ] 91% AUC 5.5K SMS - Address the curse of dimensionality on DTs MLP [ 1 ] 96% AUC 5K SMS 700 Tweets - N-gram TF-IDF features - Distribution based balancing algorithm - Rectified linear units MLP+Word2Vec [ 32 ] 92% Acc 99% Acc 10K Tweets 100K Tweets - Vector representation of the words and sentences CNN+Glove [ 25 ] 98.63% Acc 4.8K SMS - Using Dropout - 10-Fold cross-validation increased accuracy to 99.44% LSTM + Glove [ 25 ] 96.76% Acc 4.8K SMS - Using Dropout CNN 29 layers [ 6 ] 98% Acc 3M Reviews - Accuracy drops significantly with small datasets 5 CNN+RF [ 17 ] 98% AUC 14M Tweets - Different vectors (Glove, Word2Vec, etc.) - Different vector sizes - Meta-Classifier - Balanced and imbalanced datasets - For small dataset single classifiers are better than the ensemble SCNN [ 10 ] 98.65% Acc 94.40% Acc 5.5K SMS 5K Tweets - Semantic information using Word2Vec - Adds WordNet and ConcepNet SLSTM [ 11 ] 99.01% Acc 95.09% Acc 5.5K SMS 5K Tweets - Semantic information using Word2Vec - Adds WordNet and ConcepNet SSCL [ 12 ] 99.01% Acc 95.48 % Acc 5.5K SMS 5K Tweets - Combines both SCNN and SLSTM - Improves accuracy over SCNN - Improves precision, recall and F1 score over SLSTM Table 2.2: Neural Network methods summary. 26 State of the art Chapter 3 Design and implementation The first step to create the classifier is to understand the data. This is done through descriptive analysis. As its outcome, several features were extracted and a new Random Forest based model was trained and tested. Its results were compared with the existing model based on TF-IDF vectorization of the word corpus of each record. All the code is available on GitHub 1 in different Jupyter notebooks. Instructions on how to run and reproduce the results are available in the README.md file of the repository. In addition, all models, produced datasets and some intermediate vectors can be found in the Zenodo record where this work was published10.5281/zenodo.4283435 2 . 3.1 Descriptive Analysis The dataset used for this master thesis is published in Zenodo 3 and can be resolved using the digital object identifier (DOI) 10.5281/zenodo.4114093 4 . 3.1.1 Dataset description The dataset comes in JSON lines format and contains 1722305 (approximately 1.7M) rows and 25 columns. The columns correspond to the following attributes: • alternate identifiers: A list of objects representing alternative identifiers for the record. • imprint: Publication imprint (e.g. isbn, place and publisher name). • references: A list of raw textual references when identifier is not known. 1 https://github.com/ppanero/zenodo-spam-classifier/tree/v1.0.0 2 https://doi.org/10.5281/zenodo.4283435 3 https://zenodo.org/record/4114093 4 https://doi.org/10.5281/zenodo.4114093 27 28 Design and implementation • thesis: Extra information for thesis records (e.g. supervisor). • keywords: A list of free text words. • contributors: A list of composed objects representing each of the contributors and their role. • title: Free text string. • subjects: A list of subjects, their values belong to specific sets of controlled vocabularies. • meeting: Extra information for meeting records (e.g. dates, place). • access right: Access policy of the record, its value belongs to a controlled list vocabulary (e.g. Open Access, Restricted) • files: Extra metadata belonging to the associated files. • part of : Reference to a master record if it is part of another. • description: Free text description of the record. • journal: Information of the journal in which the record content was published. • communities: As a list of strings referencing one community each. • publication date: As a YYYY-MM-dd formatted string. • owners: As a list of integers corresponding to the users ID. • doi: A string following the DOI format. • license: A string whose value belongs to a controlled vocabulary enumeration. • notes: Free text internal notes for administrators of the system. • spam: A boolean flag stating if the records is spam (True) or ham (False). • recid: The primary id of the record in the system as an integer. • creators: A list of composed objects representing each creator. • resource type: A type and, if existing, subtype of the record, these values come from a controlled vocabulary. • related identifiers: A list of identifiers of related work. 3.1. Descriptive Analysis 29 For those attributes that are composed objects or list of those, their exact structure and fields can be found in the JSONSchema definition 5 . The next step is to verify which attributes can be used. Meaning which ones do not have a value and if said value could be inferred. In table 3.1 the percentage of null values for each attribute (divided per target class) is shown. It can be observed that alternate identifiers, imprint, references, thesis, contributors, sub- jects, meeting, part of, journal, notes, and related identifiers will not be useful since there is a significant amount of data missing for both classes. In addition, publication date is not useful, it represents a date introduced by the user. It is not the record creation date. Therefore, since it can be randomly chosen or simply choose the current day it could bias the classifier. A similar situation happens for owners, it is an incremental integer that represent the user. Since it is not possible to enrich the current dataset with user data the attribute does not provide any useful information and will be removed. In addition, since Zenodo assigns DOIs to every record the attribute doi won’t be useful, since only a small minority of ham data has an externally managed DOI. Therefore, the resulting dataset will consist of the following fields: keywords, titles, ac- cess right, files, description, communities, publication date, license, creators and resource type. In addition, spam will be kept as target class, and recid as index value in order to be able to identify the records. In the following section more features will be extracted from these attributes. To conclude this section, the spam attribute is examined. It represents the classification label or target class. The dataset contains 1684521 ham and 37784 spam records, which makes the dataset highly imbalanced. This can be graphically seen in figure 3.1 . Figure 3.1: Spam vs Ham in the dataset. 5 https://zenodo.org/schemas/records/record-v1.0.0.json 30 Design and implementation Attribute Spam Ham alternate identifiers 99.97 79.51 imprint 98.98 81.8 references 99.14 92.97 thesis 99.88 99.7 keywords 49.1 43.81 contributors 98.83 98.36 title 0 0 subjects 99.01 99.06 meeting 98.50 96.76 access right 0 0 files 0 0 Download 1.75 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling