Area: Deep Learning

bet	1/3
Sana	28.06.2023
Hajmi	1,75 Mb.
	#1657372

1 2 3

Bog'liq
0-zenodo-spam-classifier

Universitat Oberta de Catalunya (UOC)
MSc in Data Science
Master Thesis
Area: Deep Learning
Spam detection on digital repositories using deep neural
networks
Zenodo’s use case
—————————————————————————–
Autor: Pablo Panero
Tutor: Anna Bosch Ru´
e
Profesor: Jordi Casas Roma
—————————————————————————–
Geneva, December 18, 2020

Cr´
editos/Copyright
The content of this thesis is under as Attribution-NonCommercial-NoDerivs
3.0 Unported
CreativeCommons
license.
However, the produced code is under MIT license with copyright to Pablo Panero.
i

THESIS FILE
Title
Spam detection on digital repositories using deep neural networks
Author’s name
Pablo Panero
Tutor’s name
Anna Bosch Ru´
e
Professor’s name
Jordi Casas Roma
Delivery date (mm/aaaa)
01/2021
Degree
MSc Data Science
Thesis’ area
Deep learning
Thesis’ language
English
Keywords
Spam Detection, Neural Networks, Deep Learning
iii

FICHA DEL TRABAJO FINAL
T´ıtulo del trabajo
Detecci´
on de spam en repositorios digitales
utilizando redes neuronales profundas
Nombre del autor
Pablo Panero
Nombre del colaborador/a docente
Anna Bosch Ru´
e
Nombre del PRA
Jordi Casas Roma
Fecha de entrega (mm/aaaa)
01/2021
Titulaci´
on o programa
Master en Ciencia de Datos
´
Area del Trabajo Final
Aprendizaje profundo
Idioma del trabajo
ingl´
es
Palabras clave
Detecci´
on de spam, redes neuronales, aprendizaje profundo
v

Citation
“Spam is a waste of the receivers’ time, and, a waste of the sender’s optimism.”
— Mokokoma Mokhonoana
vii

viii

Acknowledgments
I want to thank my tutor, Anna Bosch Ru´
e, for supervising and supporting my thesis; CERN
for giving me the chance to contribute to Zenodo; the Zenodo and Invenio team for their help
and patience answering my questions, specially to Lars Holm Nielsen and Alexandros Ioannidis
Pantopikos for the guidance and support.
ix

Abstract
Nobody wants to get something they do not want, and that is spam. Spam content has become
a big problem in our digital era, and therefore it also affects digital repositories. Hosting spam
can have a big impact on a service. From the actual hardware costs of storing it, getting skewed
usage statistics, including distribution of material that violates copyright, to finally and most
important serving undesired advertisement to users.
Zenodo is a catch-all open digital repository, and as such it is a target for spam. Zenodo’s
current spam classifier is not performant enough in terms of accuracy, and requires human
intervention to classify between 30 and 500 entries per day. Moreover, the workflow set to both
run and train the classifier is not optimal: content is not classified in real time and the model is
retrained in an ad-hoc manner. This situation translates in many hours that the support team
has to spend manually classifying content.
In order to solve these problems this thesis proposes a deep neural network based classifier
along with practical guidelines to set it up in a production environment and improve the
workflow.
Keywords: Spam Detection, Classifier, Machine Learning, Deep Learning, Neural Net-
works, Digital Repository, Zenodo.
xi

xii

Abstract
Nadie quiere obtener algo que no desea, eso es spam. Este contenido no deseado se ha convertido
en un gran problema en esta era digital, y por lo tanto tambi´
en para los repositorios digitales.
Albergar spam puede tener un gran impacto en un servicio. Desde el propio coste en hardware
para almacenarlo, su influencia sobre las estad´ısticas del servicio, incluyendo la distribuci´
on de
material que incumple derechos de autor, hasta servir publicidad no deseada a los usuarios.
Zenodo es un repositorio digital de car´
acter general, y por lo tanto objetivo de spam. El
clasificador utilizado actualmente por Zenodo no obtiene un rendimiento adecuado en t´
erminos
de precisi´
on, requiriendo la intervenci´
on humana para la clasificaci´
on de entre 30 y 500 registros
al d´ıa. Adem´
as, los flujos utilizados tanto para ejecutar como para entrenar el clasificador son
sub´
optimos: Los registros no son clasificados en tiempo real, y el modelo es re-entrenado de
manera aleatoria cuando alg´
un miembro del equipo de desarrollo lo considera adecuado. Esta
situaci´
on se traduce en una gran cantidad de horas del equipo de soporte utilizadas clasificar
manualmente el contenido.
Para solventar estos problemas, este trabajo de fin de Master propone un clasificador basado
en redes neuronales profundas, junto con gu´ıas y consejos pr´
acticos para poner el clasificador
en producci´
on y mejorar los flujos de trabajo.
Keywords: Detecci´
on de spam, Clasificador, Aprendizaje Autom´
atico, Redes Neuronales,
Aprendizaje Profundo, Repositorios Digitales, Zenodo.
xiii

xiv

Contents
Abstract
xi
Abstract (Spanish)
xiii
Index
xv
List of figures
xvii
List of tables
xix
1
Introduction
3
1.1
Problem’s description, relevance and justification
. . . . . . . . . . . . . . . . .
3
1.1.1
Zenodo and the consequences of spam
. . . . . . . . . . . . . . . . . . .
3
1.1.2
Zenodo’s spam detection mechanism
. . . . . . . . . . . . . . . . . . . .
4
1.1.3
Zenodo’s spam taxonomy: learnt from experience . . . . . . . . . . . . .
5
1.1.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Methodology
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.5.1
Task groups description
. . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.5.2
Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2
State of the art
17
2.1
Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
Feature extraction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.1
Content based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.2
User or behaviour based . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2.3
Link based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3
Neural networks and deep learning approaches . . . . . . . . . . . . . . . . . . .
21
xv

xvi
CONTENTS
2.3.1
CNN and RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3
Design and implementation
27
3.1
Descriptive Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1.1
Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.1
Feature extraction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.2
Random Forest based models . . . . . . . . . . . . . . . . . . . . . . . .
43
3.2.3
Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3
Neural Networks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.3.1
Neural Networks form the literature . . . . . . . . . . . . . . . . . . . . .
47
3.3.2
CRNN on English-only content . . . . . . . . . . . . . . . . . . . . . . .
57
3.3.3
Deeper Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
3.3.4
Testing the whole dataset
. . . . . . . . . . . . . . . . . . . . . . . . . .
67
4
Final conclusions and future work
69
4.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5
Deployment in production
73
Bibliograf´ıa
74
A Deep Neural Networks configuration.
79

List of Figures
1.1
Zenodo’s user visits since October 2018.
. . . . . . . . . . . . . . . . . . . . . .
4
1.2
Zenodo’s spam report example file. . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Microsoft TDSP lifecycle.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
PEC 1 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.5
PEC 2 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.6
PEC 3 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.7
PEC 4 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.8
PEC 5 Gantt chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.9
Project’s Gantt chart.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.1
Spam vs Ham in the dataset.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2
Normalized number of keywords per class. . . . . . . . . . . . . . . . . . . . . .
32
3.3
Normalized number of files per class. . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4
Normalized amount of records with an image file. . . . . . . . . . . . . . . . . .
33
3.5
Normalized number of communities per class.
. . . . . . . . . . . . . . . . . . .
34
3.6
Normalized number of creators per class. . . . . . . . . . . . . . . . . . . . . . .
35
3.7
Normalized records with a creator identified by ORCID.
. . . . . . . . . . . . .
36
3.8
Normalized records with a creator with an affiliation. . . . . . . . . . . . . . . .
36
3.9
Normalized records by main resource type. . . . . . . . . . . . . . . . . . . . . .
37
3.10 Normalized records by resource type and subtype. . . . . . . . . . . . . . . . . .
38
3.11 Normalized records by license. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.12 Normalized number of title words per class.
. . . . . . . . . . . . . . . . . . . .
40
3.13 Normalized number of description words per class. . . . . . . . . . . . . . . . . .
42
3.14 Normalized access right per class. . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.15 Model A confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.16 Model B confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.17 Model C confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.18 Balanced dataset 2D representation. . . . . . . . . . . . . . . . . . . . . . . . . .
48
xvii

xviii
LIST OF FIGURES
3.19 Balanced dataset 3D representation. . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.20 Literature configuration A CNN training accuracy and loss. . . . . . . . . . . . .
49
3.21 Literature configuration B CNN training accuracy and loss. . . . . . . . . . . . .
50
3.22 Balanced dataset 2D representation of TF-IDF vectorization. . . . . . . . . . . .
51
3.23 Balanced dataset 3D representation of TF-IDF vectorization. . . . . . . . . . . .
52
3.24 Literature configuration A with TF-IDF vectors training accuracy and loss. . . .
52
3.25 Literature configuration B TF-IDF vectors training accuracy and loss. . . . . . .
53
3.26 Literature configuration A RNN training accuracy and loss. . . . . . . . . . . . .
54
3.27 Literature configuration B RNN training accuracy and loss. . . . . . . . . . . . .
54
3.28 Literature configuration A CRNN training accuracy and loss. . . . . . . . . . . .
56
3.29 Literature configuration B CRNN training accuracy and loss. . . . . . . . . . . .
56
3.30 Language distribution on the full balanced dataset. . . . . . . . . . . . . . . . .
58
3.31 Language distribution on the ham records. . . . . . . . . . . . . . . . . . . . . .
59
3.32 Language distribution on the spam records.
. . . . . . . . . . . . . . . . . . . .
60
3.33 English dataset PCA reduction to 2D.
. . . . . . . . . . . . . . . . . . . . . . .
61
3.34 English dataset PCA reduction to 3D.
. . . . . . . . . . . . . . . . . . . . . . .
61
3.35 CNN Configuration A on English content. . . . . . . . . . . . . . . . . . . . . .
62
3.36 CNN Configuration B on English content.
. . . . . . . . . . . . . . . . . . . . .
62
3.37 CRNN Configuration A on English content.
. . . . . . . . . . . . . . . . . . . .
63
3.38 CRNN Configuration B on English content.
. . . . . . . . . . . . . . . . . . . .
63
3.39 RNN Configuration A on English content. . . . . . . . . . . . . . . . . . . . . .
63
3.40 RNN Configuration B on English content.
. . . . . . . . . . . . . . . . . . . . .
64
3.41 Deep NN configuration I metrics. . . . . . . . . . . . . . . . . . . . . . . . . . .
65

List of Tables
1.1
PEC 1: Project definition and planning.
. . . . . . . . . . . . . . . . . . . . . .
10
1.2
PEC 2: Literature review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3
PEC 3: Design and implementation.
. . . . . . . . . . . . . . . . . . . . . . . .
12
1.4
PEC 4: Thesis writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.5
PEC 5: Project defense and presentation. . . . . . . . . . . . . . . . . . . . . . .
14
1.6
Public defense.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1
Classical methods summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2
Neural Network methods summary. . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1
Attribute presence in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2
Number of files ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3
Number of creators ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4
Number of words in the title ranges.
. . . . . . . . . . . . . . . . . . . . . . . .
39
3.5
Number of words in the description ranges. . . . . . . . . . . . . . . . . . . . . .
41
3.6
Random Forests comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.7
Literature CNN Networks configuration. . . . . . . . . . . . . . . . . . . . . . .
49
3.8
Literature CNN Networks training metrics. . . . . . . . . . . . . . . . . . . . . .
50
3.9
Literature CNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . . .
50
3.10 Literature CNN Networks with TF-IDF vectorization training metrics.
. . . . .
51
3.11 Literature CNN Networks with TF-IDF vectorization test metrics. . . . . . . . .
53
3.12 Literature RNN Networks configuration. . . . . . . . . . . . . . . . . . . . . . .
53
3.13 Literature RNN Networks training metrics. . . . . . . . . . . . . . . . . . . . . .
55
3.14 Literature RNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . . .
55
3.15 Literature CRNN Networks configuration. . . . . . . . . . . . . . . . . . . . . .
55
3.16 Literature CRNN Networks training metrics. . . . . . . . . . . . . . . . . . . . .
56
3.17 Literature CRNN Networks test metrics. . . . . . . . . . . . . . . . . . . . . . .
57
3.18 Performance metrics of configuration A CRNN. . . . . . . . . . . . . . . . . . .
65
xix

3.19 Deeper Neural Networks metrics. . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.20 Performance metrics of configuration B CNN on the full dataset. . . . . . . . . .
67
3.21 Performance metrics of configuration B RNN on the full dataset. . . . . . . . . .
67
3.22 Performance metrics of configuration A CRNN on the full dataset. . . . . . . . .
67
3.23 Performance metrics of deep NN model VI on the full dataset. . . . . . . . . . .
68
3.24 Performance metrics of deep NN model XXVIII on the full dataset. . . . . . . .
68
3.25 Performance metrics of deep NN model XXX on the full dataset. . . . . . . . . .
68

2
LIST OF TABLES

Chapter 1
Introduction
1.1
Problem’s description, relevance and justification
1.1.1
Zenodo and the consequences of spam
Zenodo
1
is a catch-all open digital repository, which enables researchers to share their work:
software, datasets, publications, presentations, any type of research artifact. One or a set of
these artifacts conform what it is called a record. It can be thought as the set of all artifacts
that relate to a single research publication. Nowadays, Zenodo is hosting more than 1.5 million
records, including more than 75% of the world’s software Digital Object Identifiers
2
(DOIs),
2.2 million files and receives 1.4 million visitors per year.
As mentioned, Zenodo is an open repository. This means that any person with an account
can publish records, and getting an account only requires to fill in three fields: email, username
and password. Openness and user experience are two highly valuable features. However, in this
case they make Zenodo an easy and rewarding target for spam. In approximate terms, Zenodo
receives between 100 and 1000 spam records every day, and as Zenodo’s popularity grows so
does the amount of spam.
This unwanted content has an important impact on Zenodo. First of all it affects user
experience in one of the main features, research sharing. As a researcher you want to find the
content you are looking for and not an advertisement, and you want your work to be found
by others and not be overtaken by spam on the results page. Some of which might even be
sharing multimedia content which violates copyrights. In addition, hosting this content means
space in a database, network bandwidth to serve it, and other hardware usage costs that turn
into monetary costs. Finally, it skews the service’s usage statistics. This means, for example,
1
https://zenodo.org
2
https://www.doi.org
3

4
Introduction
that Zenodo can not say with a 100% certainty how many legitimate visitors is having per
year. This is shown in figure
1.1
, where it can be seen that the amount of visitors is constantly
increasing. However, there are two noticeable spikes at the end of the graph. The second one
is legitimate, it was due to a publication related to COVID-19 that gained a lot of popularity.
However, the first spike (around July-August 2020) was due to spam content being massively
accessed.
Figure 1.1: Zenodo’s user visits since October 2018.
1.1.2
Zenodo’s spam detection mechanism
In first place Zenodo has a CAPTCHA on the user registration form. However, there is no
restrictions on who can register and therefore fake users are created to submit the spam.
In addition, a classifier has been set in place. Originally, it was based on Naive Bayes
algorithms. However, at the beginning of 2019, efforts to improve resulted on a new classifier
based on Random Forests, which has been used in production ever since. Its code is available
on GitHub
3
.
This classifier is run every working day (Monday to Friday) on the content generated during
the previous three days. This allows to account for the weekend and sometimes catch false
negatives if the model was retrained in the meantime. The detected records are written to a
markdown file and stored on GitHub, an example of said file is shown in figure
1.2
(Note that
parts of it had to be anonymised). Then, a member of the support team has to process them
manually.
In addition, users can report spam, a request that will be manually treated by a supporter.
When doing so, if the supporter considers that there are obvious terms in the content that
should have been caught (e.g. a movie) a manual search is performed. Most likely resulting in
more spam records being detected and marked as spam. Moreover, if the task can be allocated,
the model will be re-trained and re-deployed. Nonetheless, it has been noted that after each re-
train, the classifier performs adequately for the span of days and then drops quickly its accuracy.
3
https://github.com/zenodo/zenodo-classifier/

1.1. Problem’s description, relevance and justification
5
Figure 1.2: Zenodo’s spam report example file.
Meaning, that the spam changes and adapts fast and the classifier does not generalize enough
to catch it.
The current classifier does not perform as required. From the 100 to 1000 spam records
received every day, approximately half of them are caught by the classifier. As a result, a
supporter has to spend a considerable amount of time manually checking 500 records, but
there are other 500 that are being served to users as legitimate content. Moreover, these 500
records that were not detected will pose a challenge when processing the dataset, since it means
that there is a unknown number of false negatives (spam records) marked as legitimate.
1.1.3
Zenodo’s spam taxonomy: learnt from experience
When interviewing Zenodo’s team, it was observed that every member has learnt to identify the
spam based on specific characteristics, which are presented in the list below. Nonetheless, most
of these are not used by the current classifier, which only uses the the title and the abstract
(also called description) of the record.
• The spam in an advertisement. Most commonly for online games, live streaming of sport
events and movies or TV shows. Therefore, they contain terms such as movie, watch

6
Introduction
online, free, nfl live, poker. These terms can also be present in other languages (e.g. ver
online or gratis for Spanish).
• It contains one of the latest movie’s title (e.g. Tenet ).
• Its language can vary a lot. However, the previous terms tend to also appear in English.
• The user’s email of the the spam author is, or is related to, one of the previous terms, for
example cinema12345@foo.com.
• The keyword field of the publication contains a big amount of them, without a logical
relation between them (e.g. computer security and covid-19 ).
• It contains a single file in JPG, JPEG, PNG or other image format.
• The publication type does not logically match the file content extensions. For example,
a publication of type journal article which contains an image (e.g. PNG) but not PDF,
which would be the logical extension for an article.
• The user submitting the record has published only one.
• The user account belongs to a new user. Most likely registered on the same day, and
probably not long before the publication time.
• The IP from which the users connect matches that of previously blocked accounts.
Some of these features will be extracted from the dataset and fed to the classifier developed
in this project. However, some are user related and for the moment it is not possible to use
them due to privacy implications.
1.1.4
Conclusion
Zenodo’s popularity and usage is growing, and with it the amount of spam it receives. The
current mechanism and workflow result in a significant amount of time required from the
support team, which is increasing and becoming unmanageable.
This thesis aims at improving the model, solving the problems related to catching the spam,
and providing guidelines that will enable developers to set up a new workflow that requires less
human input, and therefore less support time.

1.2. Motivation
7
1.2
Motivation
In first place, I believe science should be open. However, I do not think this should be at cost
of quality of service. Therefore, an as a core developer of the Invenio framework
4
and Zenodo
service it is in my interest to keep, if not increase, the quality of the service provided by Zenodo.
On what is related to spam, this means giving the users the results they are looking for. In
addition, lowering the time costs that manual spam classification incur would have a significant
impact on the development team.
Mixing both my interests in data science, specifically in neural networks, and software
development for digital repositories makes this topic a perfect fit for the thesis.
1.3
Objectives
In terms of the resulting product, the objectives of this thesis are:
• Implement a classifier that improves the accuracy of the current one and properly gener-
alizes.
– Provide feedback on which data could be valuable for the classification, so service
developers can work on make it available for future iterations of the spam classifier.
– Provide a model, or set of them, which improves the current classification accuracy
and is able to process the content in a feasible amount of time. Aiming at pseudo
real-time content classification.
• Enable service developers to set the new classifier up in production.
– Provide a list with the hardware requirements to efficiently run the classifier.
– Provide guidelines or instructions to deploy and run the classifier.
In addition, the learning objectives are:
• Work on a data science project, which involves real world stakeholders.
• Gain experience using mixed architectures (i.e. using more than one neural network).
4
https://inveniosoftware.org/products/framework/

8
Introduction
1.4
Methodology
The chosen methodology to carry out this project is Microsoft Team Data Science Process
5
,
from now on abbreviated as TDSP. TDSP focuses on the business needs and the value of the
outcome. It establishes a clear structure to follow throughout the data science process. This
process is cyclic, and does not finish, even if the model (outcome) is deployed. TDSP divides the
process in four phases that can provide feedback to previous ones. Theses phases are described
below and illustrated in figure
1.3
.
1. Business understanding: In this phase the idea or final output is defined from the busi-
ness perspective, identifying and evaluating possible scenarios. In addition, the planning
to deliver the solution is generated.
2. Data acquisition and understanding: This phase aims at getting familiar and finding
facts about the data. Its outputs can feed back to the previous phase (business under-
standing). Moreover, the foundations for the data flow are established. This means that
clear connections with the different data sources will be set, along with means to extract
the data from those.
3. Modeling: A model is created, built, and verified against the original business question
or problem. The model needs to be able to respond the question or solve the problem,
and also add business value (e.g. perform better than the existing solution).
4. Deployment The proposed solution is set in production along with means to monitor
its performance, and a long-term maintenance plan. Note that due to time constraints,
this project will not carry out this phase. However, guidelines and best practices on how
to deploy the solution will be provided.
5
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/

1.4. Methodology
9
Figure 1.3: Microsoft TDSP lifecycle.

10
Introduction
In terms of technologies and tools, this project will be implemented in Python using well
recognised and supported libraries and frameworks such as NumPy, Pandas, Keras, TensorFlow
and SciKit Learn. Python notebooks will be used at first, however the final output will be
packaged using Python standards (e.g. setuptools).
1.5
Planning
In this section the project planning is presented. The project has been divided in six task
groups, one for each PEC plus one for the public defense. Each one of these task groups will be
described along with the tasks it contains. At the end of the section a Gantt chart containing
the whole planning is shown, however, for clarity a Gantt for each task group has been added
with its description.
1.5.1
Task groups description
Note that the amount of days shown for each task of the task groups is the total amount
assigned to it, however it does not specify if the task overlaps with another in time. This can
be seen in the Gantt charts.
Title
PEC 1: Project definition and planning
Begin
16/09/2020
End
27/09/2020
Days
12
Outputs
• A description of the problem, answering why it is important and why it should be solved.
• A detailed planning of how the problem will be tackled.
Tasks
• Interview developers and supporters. Note that there is not difference between the two
teams, the members are the same (4 days)
• Interview service managers (3 days)
• Write problem description and requirements (3 days)
• Define planning (2 days)
Table 1.1: PEC 1: Project definition and planning.

1.5. Planning
11
Figure 1.4: PEC 1 Gantt chart.
Title
PEC 2: Literature review
Begin
28-09-2020
End
18-10-2020
Days
21
Outputs
• A comprehensive review of the state of the are of the areas targeted by the project.
Tasks
• Review spam classic methods for spam detection (outside of the deep learning field) (5
days)
• Review spam detection approaches using neural networks, emphasizing on deep neural
networks (10 days)
• Summarize findings from the three previous tasks (6 days)
Table 1.2: PEC 2: Literature review.
Figure 1.5: PEC 2 Gantt chart.

12
Introduction
Title
PEC 3: Design and implementation
Begin
19-10-2020
End
20-12-2020
Days
63
Outputs
• A neural network architecture for the classifier
• A trained classifier
• A set of guidelines to deploy the classifier in production
• A set of guidelines to implement a workflow to keep the classifier running with the least
possible human interaction
Tasks
Note that task group will be carried out several times, as an iterative process. However, the
total amount of iterations will result in the specified amount of days (63).
• Interview developers, supporters and service manager to gather design and user experience
requirements (e.g. bias the model towards false negatives) (3 days)
• Getting comfortable with the dataset. Carry out a first iteration of the descriptive analysis
(12 days)
• Dataset processing. Carry out normalization, feature extraction and other operations
needed to ready the data for the modeling phase (12 days)
• Test different models and architectures, train, test, refactor according to the evaluation
results. Propose a final model/architecture (30 days)
• Gather hardware and technologies requirements. This task should allow us to understand
what are the constraints under which the classifier will be run (3 days)
• Based on the obtained architecture and the hardware requirements, make a deployment
proposal (5 days)
• Document the design and implementation documentation (5 days)
• Buffer: Set aside for possible unexpected delays (3 days)
Table 1.3: PEC 3: Design and implementation.

1.5. Planning
13
Figure 1.6: PEC 3 Gantt chart.
Title
PEC 4: Thesis writing
Begin
21-12-2020
End
03-01-2020
Days
14
Outputs
• Final version of the thesis in the appropriate format (PDF)
Tasks
• Finish writing the outcome of the previous task group (PEC 3) (3 days)
• Document results and conclusions and add them to the corresponding sections (7 days)
• Review the whole content of the thesis and apply corresponding changes (3 days)
• Set aside for possible unexpected delays (1 day)
Table 1.4: PEC 4: Thesis writing.
Figure 1.7: PEC 4 Gantt chart.

14
Introduction
Title
PEC 5: Project defense and presentation
Begin
04-01-2020
End
10-01-2020
Days
7
Outputs
• Set of slides to present the thesis to the jury
Tasks
• Prepare the set of slides for the presentation (4 days)
• Practical comments to allow a better presentation flow using presenter view (2 day)
• Defend the thesis (1 day)
Table 1.5: PEC 5: Project defense and presentation.
Figure 1.8: PEC 5 Gantt chart.
Title
Public defense
Begin
11-01-2020
End
20-01-2020
Days
10
Outputs
• Outcome of the thesis
Tasks
• Defend the thesis (1 day)
Table 1.6: Public defense.

1.5. Planning
15
1.5.2
Gantt chart
The black lines represent the task groups, which map one to one to the PECs. Both the task
groups and their tasks follow the same order than the one used to describe them in the previous
section.
Figure 1.9: Project’s Gantt chart.

16
Introduction

Chapter 2
State of the art
Before going in details on how the problem of spam has been approached, let’s dig a bit on the
consequences it has had on the past years. Firstly, on the monetary side, already in 2005 it was
estimated that some spammer organizations could earn between US$ 2M to US$ 3M per year.
In extreme cases, a single spamming botnet could earn close to US$ 2M per day as claimed by
one IBM representative [
14
]. A few years later, in 2007, the cost of web spam was estimated
at US$ 100B globally [
2
], going up to US$ 130B in 2009 [
13
]. Switching to the technical side
of spam, in the first half of 2013 it was already growing at a trend of 355%, with 1 in each
21 social messages being unwanted or spam [
22
]. Therefore, affecting many services, including
well known social networks such as Twitter. For example, the attack they suffered on February
20th 2010, which made the social network have to disable their trending topic feature until they
were able to handle the spam, graphical proof of this is show by Wang, A. H.[
31
]. Furthermore,
recent studies estimate that between 9% and 15% of Twitter’s user accounts are bots, most of
these used to distribute undesired content [
30
].
The literature review is presented in the following sections. Firstly, relevant work using
classical algorithms such as Decision Tress or Naive Bayes are presented. Then, work related
to spam detection using neural networks, specifically deep neural networks is reviewed. To
conclude, a table with a summary of the presented methods is shown.
2.1
Classical methods
There has been many approaches to detecting spam on wide variety of content such as web
pages, email and social networks, many of these using machine learning algorithms. Yu et
al. [
34
] made a comparative on email spam classification between Naive Bayes (NB), Support
Vector Machines (SVM), Relevance Vector Machines (RVM) and a Neural Network (A standard
non-linear feed-forward network with the sigmoid activation function). Yu et al. [
34
] claim that
17

18
State of the art
a neural network (NN) is unsuitable to be used for the task. They stated that besides the longer
training and parameter selection time, the main problem of the neural network was the tendency
to overfit. However, the authors acknowledge that this was most likely due to the small size
and unbalanced nature of the dataset (6K emails). The NN obtained an average accuracy of
88%, while the other three methods were above 90%. Both SVM and RVM were close to a 94%
accuracy, with RVM being the preferred one due to the usage of less vectors and much faster
testing time. Renuka et al. [
24
] carried out another comparative on email spam classification,
on a dataset containing 4601 emails. Renuka et al. [
24
] compared J48 (A Java implementation
of C4.5), Naive Bayes using Filtered Bayesian Learning (FBL) to increase its performance, and
a multilayer perceptron (MLP). In this case the MLP obtained the highest accuracy with a
93%, closely followed by J48 (92%). The NB approach obtained a 89% accuracy. In addition,
MLP obtained a 0% of false positives, a highly important quality as stated by the authors.
Nonetheless, a considerable drawback of the MLP approach was the training time, it took 9.48s
compared to 0.06s and 0.02s of J48 and NB respectively.
In the field of web spam classification, Goh et al. [
8
] made an extensive comparative between
SVM, MLP, NB, Bayesian Networks (BN), Decision Trees (C4.5), Random Forests (RF) and
K-Nearest Neighbours (KNN), using the area under the curve (AUC) as spamicity measure. In
addition, Goh et al. [
8
] used extra techniques such as boosting, bagging, rotation forest and
stacking to improve performance. The authors state that, in spite of the high accuracy of SVM
they would be greatly affected by contaminated datasets, proposing MLP as an alternative. In
this comparative the best performing algorithm was the RF with Real AdaBoost, achieving a
93.7% AUC. Nonetheless, the second best was MLP with an average difference to RF of a 5.4%
AUC.
Another spam targeted field are social networks. McCord et al. [
19
] compared NB, SVM
KNN and RF on a set of 100K tweets (distributed evenly from 1K accounts). The authors
show that Random Forest was the algorithm with better precision and F-measure, with a
95.7%. McCord et al. believe that one of the reasons it outperformed the others was due to its
ability to deal with unbalanced datasets.
One could question if clustering algorithms have a place on the spam detection field. Goh
et al.[
8
] and McCord et al. [
19
] already added KNN to their comparative on web and Twitter
spam classification. Similar work has been done on the email field by Trudigan et a. [
29
], using
a dataset of 1K emails. The authors compare KNN (using an approximate neighbour variation)
to DT and NN (with 5 hidden layers). KNN achieved good results, with an average of 90% of
accuracy, however still below the other two algorithms. In addition, the NN was the only one
to achieve a 0% of false positives. On the other hand, it was in average 4% less accurate than
the DT (C4.5) who reached a 98% accuracy. Nonetheless, the authors believe that with further

2.2. Feature extraction
19
optimization on the NN parameters and a larger dataset, it would match the accuracy of the
DT.
From the presented literature it can be understood that at first neural networks were not
suitable and algorithms such as SVM or RVM were preferred. However, with time and research
neural networks started gaining importance and to be seen as potential alternatives for the
spam detection problem. Random Forest has been established as the best performing algorithm,
closely followed by MLP.
2.2
Feature extraction
As stated by Trudgian et al. [
29
] ”The feature extraction engine is sometimes considered to be
more important than the classification algorithm in the problem of spam filtering” and therefore
a variety of approaches are presented below. From the literature it can be observed that there
are three main types of features: content based, link based and user or behaviour based.
2.2.1
Content based
Content based are those extracted from the corpus of the entity to classify (an email, a web
page, a tweet, etc.). Renuka et al. [
24
] used only content based features on their comparative.
These features are some of the ones that would first come to mind: number of words in the
text body, number of words in the title/subject, average word length, independent trigram
likelihood and entropy of trigrams. In the field of web spam Goh et al. [
8
] used fraction of
anchor text, visible text corpus precision and corpus recall, query precision and query recall.
Trudigan et al. [
29
] used a word list approach. This word list contains 150 spam and 50
non-spam words, because it resulted in higher accuracy than a balanced list 94.3% vs 88.7%
and less false positives. The creation of the word list involves a pseudo-probabilistic method,
namely the sum (for every message) of the amount of times a word is in a message divided by
total amount of messages. A similar approach was used by McCord et al. [
19
], who used the
weighted difference of a custom made list of words, and stated that the word variance between
spam and non-spam messages is significantly different. This fact is backed up by a similar
affirmation from Spirin et al. [
26
].
Yu et al. [
34
] used LINGER [
5
] for content based feature extraction. LINGER is a neural
network approach that supports the bag of words representation, where all unique terms (e.g.
numbers, words, special symbols) in the training corpus are identified and each of them is
treated as a single feature. Then, a feature selection is applied to choose the most important
words and reduce dimensionality. Each document will be represented by a vector that contains
a normalized weighting for every word according to its importance.

20
State of the art
Another approach to content based feature extraction is language modeling. Martinez et al.
[
18
] used it to obtain 42 features on two different web spam datasets. Using said features the
authors obtained a 6% and 2% F-measure improvement using a simple decision tree classifier.
Spirin et al. [
26
] used language models and detected spam web pages with a 97.2% accuracy.
The underlying idea being that these models are likely to be substantially different for a blog
and a spam page due to random nature of spam comments.
In addition, Spirin et al. [
26
] investigated the possibility of using online services as features.
For example: Online Commercial Intention (OCI) value assigned to a URL by Microsoft Ad-
Center, Yahoo! Mindset classification of a page as either commercial or non-commercial, Google
AdWords popular keywords, and number of Google AdSense ads on a page. They report an
increase in accuracy by 3%.
In this section several features and ways of extracting them were presented. Many of them
are simple and/or easy to extract such as number of words in the text, and have proven to be
useful. In previous section some characteristics of Zenodo spam were introduced, one of those
was the frequent appearance of certain terms. Therefore, the the word list feature, as the ones
used by Trudigan et al. [
29
] and McCord et al. [
19
], could be highly relevant for the Zenodo
classifier.
2.2.2
User or behaviour based
McCord et al. [
19
] found out an interesting pattern on spammers’ behaviour. It was shown
that they post during the early morning and legitimate users during the afternoon. Therefore,
the authors added the distribution of tweets over the 24-hour period as a feature, divided in 8
slots of 3 hours.
Xie et al. [
33
] took the approach of analyzing temporal patterns in order to treat the spam
in web reviews. This problem occurs when reviews are used to lower or increase a certain
place or service rating. Each of these reviews come from a unique user, and therefore it is
called the ”singleton” review problem. By transforming this problem to a temporal pattern
discovery problem, the authors identify three aggregate statistics which are indicative of this
type of spam attack, then they construct a multidimensional time series using these statistics.
Finally, a multi-scale anomaly detection algorithm on multidimensional time series based on
curve fitting is design. The results show that the proposed algorithm is effective in detecting
singleton review spam attacks. Nonetheless, note that the validation process had to be done
manually since there was no already classified reviews dataset available.
Browser tracking features were explored by Spirin et al. [
26
] but were discarded due to its
privacy implications.
In this section it was shown how spammers tend to behave in similar ways, be it automated

2.3. Neural networks and deep learning approaches
21
as in the study of Xie et al. [
33
] or when there are people behind it as in the case shown by
McCord et al. [
19
]. This could be an interesting feature for the Zenodo spam classifier where the
spam is also manually introduced by humans, in a very similar fashion to the singleton review
problem (i.e. one spam publication per user) as mentioned in the Zenodo spam taxonomy in
the previous chapter.
2.2.3
Link based
The idea behind link based features is similar to that of the word list presented in the content
based feature section. Legitimate web pages or content links to other legitimate ones, while
spam content links to other spam ones. Most of the approaches are based on the concept of
trust, based on the incoming or outgoing links and their category (spam or not).
Benczur et al. [
3
] obtained almost 100% accuracy using a propagation of anti-trust levels
on the graph representing the web pages. However, the dataset contained only a 0.28% of spam
pages. Similar work was carried out by Krishnan et al. [
15
]. Castillo et al. [
4
] obtained a
88.4% F-measure with a 6.3% of false positives using a similar link based features along with
content based features. On the other hand, McCord et al. [
19
] pointed out that, while features
like node degree and frequency can be very useful, they might heavily penalize new and young
users.
Related to link based features, but applicable with content or user is the work of Svore et
al. [
28
], where the authors use rank-time features (those that helped a search engine decide
that it had to return an item as a result). Improving the recall up to a 25% and obtaining a
similar precision than the same SVM without rank-time features.
Literature on link based features has not been explored in depth since it is not applicable to
the Zenodo use case due to the lack of links in between publications. Citations could be used
as an approximation. However, the number of citations as a feature is already enough, since
spam does not cite and it is not cited by any other publication.
2.3
Neural networks and deep learning approaches
In the previous sections classical classification methods were presented, along with three types
of features that can be used to feed those methods. However, this approach requires in-depth
field knowledge, it is highly time consuming and features can be bypassed once learnt by the
spammers. Gao et al. [
7
] developed a deep learning approach to by pass spam detection
methods, managing to reduce classification accuracy from around 90% to almost 40%.
As mentioned before, Random Forest seem to be the algorithm that best performed across
datasets. Profiting from that, Sumathi et al. [
27
] propose a combined architecture, which adds

22
State of the art
a deep neural network. Helping extract knowledge from the text without the need of human
input to identify features. Nonetheless, this approach achieved a 88.59% of accuracy on an
unbalanced dataset containing 4597 emails from which only 1813 are spam.
Another approach that could fall into the deep learning category was implemented by He et
al. [
9
]. Using a linguistic attribute hierarchy (LAH) embedded with linguistic Decision Trees
achieving a 91% AUC on SMS spam detection. It is worth mentioning that this approach helps
tackle the curse of dimentionality problem, which is the main cause of not using Decision Trees.
Barushka et al. [
1
] also state that spam detection algorithms cannot effectively deal with
high-dimensional data and suffer from overfitting issues. To address this issues, the authors
approach this problem using a n-gram TF-IDF feature selection, to feed a modified distribution-
based balancing algorithm that will then be processed by a regularized deep MLP with rectified
linear units. Another advantage of this approach is that no additional dimensionality reduc-
tion is necessary and spam dataset imbalance is addressed using a modified distribution-based
algorithm. This method achieved a 96% AUC and was tested on SMS and Twitter data. Wu
et al. [
32
] achieved similar results using a MLP and applying Word2Vec [
20
] to pre-process the
input tweets, the resulting accuracy was between 92% and 99% depending on the dataset.
2.3.1
CNN and RNN
Both Convolutional (CNN) and Recurrent (RNN) neural networks have been very successfully
applied to the field of computer vision and speech recognition.
Recent studies apply this
techniques to the NLP (Natural Language Processing) field. CNNs would be helpful to capture
local and temporal features including n-grams from texts, while RNNs would help to handle
the long term dependency of word sequences.
Roy et al. [
25
] made a comparison between CNN and RNN using long-short term memory
(LSTM) units, using an unbalance SMS spam dataset (4827 messages, where 747 were spam).
The authors used Glove [
23
] to vectorise the words. The best results were achieved with a 3
level CNN using dropout, which obtained an accuracy of 98.63%. The same model using a
10-fold cross validation obtained 99.44% accuracy. As for LSTM, the best results were also
obtained when using dropout, it achieved a 96.76% accuracy.
Following the state of the art in computer vision, Conneau et al. [
6
] studied the possibility
of using deeper CNNs. Following the philosophy of VGG and ResNets, combining convolutions
of 3, 5 and 7 tokens, The authors tested networks of 9, 17, 29 and 49 convolutional layers for
classification tasks of 2 to 14 classes (considerably lower than those of computer vision). The
aim of the study was to perform the processing at character level, in the same fashion that
computer vision tasks would do at pixel level. The best results were obtained with 29 layers.
This approach was proven successful with an average of 98% accuracy for big datasets (3M),

2.4. Summary
23
however, it performed poorly ( 60%) for unbalanced small datasets (100K to 700K).
Madisetty et al. [
17
] proposed an approach that combined 5 CNNs and one feature based
Random Forest to predict spam on Twitter datasets. Each CNN uses different word embeddings
(Glove or Word2vec) of different sizes to train the model. The Random Forest uses content
based, user based, and n-gram features. In order to combine the results from the 6 different
models the authors used a meta classifier, using a multilayer neural network. The method was
tested on both balanced and imbalanced datasets and obtained an average of 99.06% AUC.
Nonetheless, it was shown that for big and balanced datasets the ensemble does not provide a
significant improvement over individual modules, but it is capable of obtaining good results for
small imbalanced datasets.
Jain et al. [
10
][
11
] studied the possibility of adding semantic information to the data before
training the models. The words are converted into word vectors using Word2Vec vectors of
dimension 300. If the word is not present in Word2Vec, similar words are found using WordNet
[
21
] and ConceptNet [
16
] dictionaries which can then be converted into word vectors. The
authors applied this technique both to CNN [
10
] and RNN with LSTM [
11
]. In addition, Jain
et all proposed an approach that uses both CNN and LSTM in a combined model [
12
] called
Sequential Stacked CNN-LSTM (SSCL). The SSCL model extracts the local region features and
n-gram information using the CNN and long term dependency information using the LSTM
network. The accuracy of the SSCL model is increased by 0.36% in the case of SMS spam
dataset and 1.08% in the case of Twitter dataset. When comparing the SSCL with SLSTM,
there is a significant increase in performance of the spam detection on the Twitter dataset
(0.39%). In case of SMS Spam dataset, there is an increase in precision, recall and F1 score.
The SSCL model achieved a 99.01% accuracy on the SMS dataset and a 95.48% accuracy on
the Twitter dataset.
2.4
Summary
In this section a brief summary of the presented methods is shown. Many surveys and com-
parisons were presented, but only the most relevant or best performing methods of each case
are mentioned in the tables below. Note that features and its techniques are not included since
they are usually the input of the classical methods. In addition, and for display friendliness the
content is divided in two tables, one for classical and one for neural networks related methods.

24
State of the art
Model
Ref
%
Dataset
Details
RVM
[
34
]
94% Acc
6K email
- Low amount of vectors
- Fast testing time
J48
[
24
]
92% Acc
4.6K email
- Fast training time (0.06s)
MLP
[
24
]
93% Acc
4.6K email
- 0% of false positives
- Slow training time (9.48s)
RF
[
8
]
93.7% AUC
5.7K web
- Using Real AdaBoost
RF
[
19
]
95.7% Acc
100K Tweets
- Works well with imbalanced datasets
KNN
[
29
]
90% Acc
1K emails
- Using approximate neighbour variation
Table 2.1: Classical methods summary.

2.4. Summary
25
Model
Ref
%
Dataset
Details
RF+NN
[
27
]
88.59% Acc
4.6K emails
- NN to extract features from text
LAH
[
9
]
91% AUC
5.5K SMS
- Address the curse of dimensionality
on DTs
MLP
[
1
]
96% AUC
5K SMS
700 Tweets
- N-gram TF-IDF features
- Distribution based balancing algorithm
- Rectified linear units
MLP+Word2Vec
[
32
]
92% Acc
99% Acc
10K Tweets
100K Tweets
- Vector representation of the words
and sentences
CNN+Glove
[
25
]
98.63% Acc
4.8K SMS
- Using Dropout
- 10-Fold cross-validation increased
accuracy to 99.44%
LSTM + Glove
[
25
]
96.76% Acc
4.8K SMS
- Using Dropout
CNN 29 layers
[
6
]
98% Acc
3M Reviews
- Accuracy drops significantly with
small datasets
5 CNN+RF
[
17
]
98% AUC
14M Tweets
- Different vectors (Glove,
Word2Vec, etc.)
- Different vector sizes
- Meta-Classifier
- Balanced and imbalanced datasets
- For small dataset single classifiers
are better than the ensemble
SCNN
[
10
]
98.65% Acc
94.40% Acc
5.5K SMS
5K Tweets
- Semantic information using Word2Vec
- Adds WordNet and ConcepNet
SLSTM
[
11
]
99.01% Acc
95.09% Acc
5.5K SMS
5K Tweets
- Semantic information using Word2Vec
- Adds WordNet and ConcepNet
SSCL
[
12
]
99.01% Acc
95.48 % Acc
5.5K SMS
5K Tweets
- Combines both SCNN and SLSTM
- Improves accuracy over SCNN
- Improves precision, recall and F1
score over SLSTM
Table 2.2: Neural Network methods summary.

26
State of the art

Chapter 3
Design and implementation
The first step to create the classifier is to understand the data. This is done through descriptive
analysis. As its outcome, several features were extracted and a new Random Forest based model
was trained and tested. Its results were compared with the existing model based on TF-IDF
vectorization of the word corpus of each record.
All the code is available on GitHub
1
in different Jupyter notebooks. Instructions on how
to run and reproduce the results are available in the README.md file of the repository. In
addition, all models, produced datasets and some intermediate vectors can be found in the
Zenodo record where this work was published10.5281/zenodo.4283435
2
.
3.1
Descriptive Analysis
The dataset used for this master thesis is published in Zenodo
3
and can be resolved using the
digital object identifier (DOI) 10.5281/zenodo.4114093
4
.
3.1.1
Dataset description
The dataset comes in JSON lines format and contains 1722305 (approximately 1.7M) rows and
25 columns. The columns correspond to the following attributes:
• alternate identifiers: A list of objects representing alternative identifiers for the record.
• imprint: Publication imprint (e.g. isbn, place and publisher name).
• references: A list of raw textual references when identifier is not known.
1
https://github.com/ppanero/zenodo-spam-classifier/tree/v1.0.0
2
https://doi.org/10.5281/zenodo.4283435
3
https://zenodo.org/record/4114093
4
https://doi.org/10.5281/zenodo.4114093
27

28
Design and implementation
• thesis: Extra information for thesis records (e.g. supervisor).
• keywords: A list of free text words.
• contributors: A list of composed objects representing each of the contributors and their
role.
• title: Free text string.
• subjects: A list of subjects, their values belong to specific sets of controlled vocabularies.
• meeting: Extra information for meeting records (e.g. dates, place).
• access right: Access policy of the record, its value belongs to a controlled list vocabulary
(e.g. Open Access, Restricted)
• files: Extra metadata belonging to the associated files.
• part of : Reference to a master record if it is part of another.
• description: Free text description of the record.
• journal: Information of the journal in which the record content was published.
• communities: As a list of strings referencing one community each.
• publication date: As a YYYY-MM-dd formatted string.
• owners: As a list of integers corresponding to the users ID.
• doi: A string following the DOI format.
• license: A string whose value belongs to a controlled vocabulary enumeration.
• notes: Free text internal notes for administrators of the system.
• spam: A boolean flag stating if the records is spam (True) or ham (False).
• recid: The primary id of the record in the system as an integer.
• creators: A list of composed objects representing each creator.
• resource type: A type and, if existing, subtype of the record, these values come from a
controlled vocabulary.
• related identifiers: A list of identifiers of related work.

3.1. Descriptive Analysis
29
For those attributes that are composed objects or list of those, their exact structure and
fields can be found in the JSONSchema definition
5
.
The next step is to verify which attributes can be used. Meaning which ones do not have
a value and if said value could be inferred. In table
3.1
the percentage of null values for each
attribute (divided per target class) is shown.
It can be observed that alternate identifiers, imprint, references, thesis, contributors, sub-
jects, meeting, part of, journal, notes, and related identifiers will not be useful since there is a
significant amount of data missing for both classes.
In addition, publication date is not useful, it represents a date introduced by the user. It is
not the record creation date. Therefore, since it can be randomly chosen or simply choose the
current day it could bias the classifier.
A similar situation happens for owners, it is an incremental integer that represent the user.
Since it is not possible to enrich the current dataset with user data the attribute does not
provide any useful information and will be removed.
In addition, since Zenodo assigns DOIs to every record the attribute doi won’t be useful,
since only a small minority of ham data has an externally managed DOI.
Therefore, the resulting dataset will consist of the following fields: keywords, titles, ac-
cess right, files, description, communities, publication date, license, creators and resource type.
In addition, spam will be kept as target class, and recid as index value in order to be able
to identify the records. In the following section more features will be extracted from these
attributes.
To conclude this section, the spam attribute is examined. It represents the classification
label or target class. The dataset contains 1684521 ham and 37784 spam records, which makes
the dataset highly imbalanced. This can be graphically seen in figure
3.1
.
Figure 3.1: Spam vs Ham in the dataset.
5
https://zenodo.org/schemas/records/record-v1.0.0.json

30
Design and implementation
Attribute
Spam
Ham
alternate identifiers
99.97
79.51
imprint
98.98
81.8
references
99.14
92.97
thesis
99.88
99.7
keywords
49.1
43.81
contributors
98.83
98.36
title
0
0
subjects
99.01
99.06
meeting
98.50
96.76
access right
0
0
files
0
0

Download 1,75 Mb.

Do'stlaringiz bilan baham:

1 2 3