Area: Deep Learning

bet	2/3
Sana	28.06.2023
Hajmi	1.75 Mb.
	#1657372

1 2 3

Bog'liq
0-zenodo-spam-classifier

part of
98.99
98.91
description
0
0
journal
97.58
76.99
communities
98.2
52.53
publication date
0
0
owners
0
0
doi
0
0.02
license
1.21
2.23
notes
79.83
93.1
spam
0
0
recid
0
0
creators
0
0
resource type
-
-
related identifiers
96.26
71.39
Table 3.1: Attribute presence in the dataset.

3.2. Classical methods
31
3.2
Classical methods
In this section Random Forest based algorithms will be tested with TF-IDF encoded text data.
In addition, features will be extracted from the previously selected attributes and a new dataset
containing only categorical data will be generated. This new dataset will be tested on a Random
Forest classifier.
3.2.1
Feature extraction
As it was shown in the dataset description, many attributes are complex objects that cannot
be processed by the classifier. Based on expert knowledge gathered from the Zenodo developers
and supporters several features are extracted, and bar plots comparing the spam class of each
new attribute are shown. Due to the imbalanced nature of the dataset, the displayed values
are normalized in order to be displayed in a friendly manner.
num keywords:
The attribute keywords is a list of words. This new attribute represents the number of
words in said list. The distribution was analysed using the Python library Seaborn
6
, and the
values were classified in buckets in order to avoid high dimensional vectors in later stages.
There were records with up to 250 keywords. However the amount of records with more
than 10 was minimal (a total of 75K), and therefore they were put in the same bucket (10).
This attribute can take its value from a total of 11 classes, from 0 to 10 (both included). The
distribution per target class is shown in Figure
3.2
.
num files:
The files attribute is a list of objects, which contain very detailed information about each
file (e.g. checksum). However, only the amount of files and its type have shown to be useful to
the experts.
The num files attribute represents the number of files associated to the record. The dataset
contains records with a number of files that range from 0 to 144402. However, the amount of
records diminishes greatly when the amount of files increases. As in the previous case, buckets
were created. Resulting in 8 classes, from 0 to 7. The first 4 represent the exact amount of
files. However, the last 4 are the represent a range which can be seen in Table
3.2
.
The values shown in figure
3.3
confirm what was stated by the Zenodo supporters: Most
spam records contain only one file. However, we can see that approximately the same amount
of ham records contain a single file. On the other hand, it can be seen that a number of files
6
https://seaborn.pydata.org/

32
Design and implementation
Figure 3.2: Normalized number of keywords per class.
class
range
4
[4, 10)
5
[10, 30)
6
[30, 50)
7
[50, max]
Table 3.2: Number of files ranges.
with 4 or 5 files seem to predominate on spam records (note that 4 and 5 are ranges, and
therefore mean between 4 and 30)
has image:
Using the filetype value of each file object, we can check if the record contains at least one
file of one of the following types jpg, jpeg, png, bmp, gif, tiff, exif, ppm, pgm, pbm, pnm, webp,
svg. Note that Zenodo does not provide a list of what is understood as image format. Therefore
this list has been created in an ad-hoc manner according to the author’s knowledge.
Figure
3.4
confirms Zenodo’s staff experience. It can be observed that about the amount of
spam records with an image file is almost double than those without and image.
num communities:
This attribute represents the number of communities the record belongs to. Records have
been shown to belong to between 0 and 8 communities.
However, the amount of records

3.2. Classical methods
33
Figure 3.3: Normalized number of files per class.
Figure 3.4: Normalized amount of records with an image file.

34
Design and implementation
Figure 3.5: Normalized number of communities per class.
belonging to more than 2 communities is negligible and therefore were all set into the same
bucket. It can be seen in figure
3.5
that almost all spam records do not belong to a community.
num creators:
From the creators field we can extract the amount of them, and also some values that might
increase the trust in the person publishing the record. These are identifiers (i.e. ORCID
7
) and
affiliations (e.g. An associated institution).
This attribute represents the number of creators of the record. As in previous extracted
features, buckets were created leaving a total of 8 possible classes (from 1 to 8). Being 6, 7 and
8 ranges as shown in Table
3.3
.
class
range
6
[5, 7)
7
[7, 10)
8
[10, max)
Table 3.3: Number of creators ranges.
In Figure
3.6
it can be seen that a significant portion of the spam records contain only one
creator.
7
https://orcid.org

3.2. Classical methods
35
Figure 3.6: Normalized number of creators per class.
creator has orcid :
This attribute denotes the fact that at least one creator has an ORCID identifier. Figure
3.7
shows that there is no significant difference between ham and spam data with respect to orcid
identifiers.
It seems unrealistic that an entity such as ORCID was verifying an author submitting spam.
This fact was verified with the Zenodo staff and it was understood that the identifiers are not
verified against ORCID.
creator has affiliation :
This attribute denotes the fact that at least one creator has an affiliation. As with the
previous feature, there is no significant difference between ham and spam data with respect to
creator’s affiliation. This is shown in Figure
3.8
.
type:
The resource type attribute has a main type, and in some cases a subtype. Zenodo’s de-
fault value in the uploads form is type publication with subtype article. Both Figure
3.9
and
specifically
3.10
confirm that spammers do not put effort in changing this value.
In addition, it can be observed that an important amount of content belongs to the resource
type image which might contradict or conflict with the belief of Zenodo’s supported that ”most
of spam only contain an image”. While, it might be true that spam records contain an image,
it does not allow to distinguish from ham records by itself. The has image feature showed that
about 35% of ham records also contain an image.

36
Design and implementation
Figure 3.7: Normalized records with a creator identified by ORCID.
Figure 3.8: Normalized records with a creator with an affiliation.

3.2. Classical methods
37
Figure 3.9: Normalized records by main resource type.

38
Design and implementation
Figure 3.10: Normalized records by resource type and subtype.
type full :
The resource type expanded with subtype. Using the format type-subtype.
license:
In the original attribute there were 4 licenses that take up to 85% of the records. These were
notspecified, cc-zero, CC-BY-4.0, CC-BY-SA-4.0, cc-by. Therefore, the rest were set as other.
Null values are a separate category (no-license), which leaves a total of 7. The distribution is
shown in Figure
3.11
.
Zenodo’s upload form default value is CC-BY-4.0, which shows once again that spammers
do not put the effort in changing defaults.

3.2. Classical methods
39
Figure 3.11: Normalized records by license.
num words title:
Literature stated that the amount of words used in the text could distinguish spam from
ham. Therefore, both title and description amount of words were extracted. Note that the text
was cleaned from HTML tags. After observing the distribution several buckets were created,
resulting in a total of 31 possible classes, from 0 to 24 plus the ranges shown in Table
3.4
.
class
range
25
[25, 30)
30
[30, 35)
35
[35, 40)
40
[40, 45)
45
[45, 50)
50
[50, max)
Table 3.4: Number of words in the title ranges.
Figure
3.12
shows that the distribution of the amount of words in the title is slightly different.
Spam records tend to contain between 5 and 15 words, while ham has its peak at 4 and decreases
significantly afterwards. In addition, the amount of spam records with more than 20 words in
the title is negligible.

40
Design and implementation
Figure 3.12: Normalized number of title words per class.

3.2. Classical methods
41
num words description :
In the same fashion than the title, the amount of words of the description were extracted.
Note that the text was cleaned from HTML tags. Likewise, after observing the distribution
buckets were created, resulting in 26 classes. From 0 to 9 plus the 16 ranges shown in Table
3.5
.
class
range
10
[10, 15)
15
[15, 20)
20
[20, 30)
30
[30, 40)
40
[40, 50)
50
[50, 75)
75
[75, 100)
100
[100, 150)
150
[150, 200)
200
[200, 300)
300
[300, 400)
400
[400, 500)
500
[500, 1000)
1000
[1000, 2000)
2000
[2000, 3000)
3000
[3000, max)
Table 3.5: Number of words in the description ranges.
It can be observed in Figure
3.13
that spam records tend to contain a big amount of text
in the description. However, there is a similar proportion of ham records in the ranges from 50
to 200.
access right :
This attribute was not modified, since it was already divided in classes and being a required
field obliges both spam and ham records to have a non-null value. Zenodo is oriented towards
open science, therefore the default value for this attribute is open. It can be seen in Figure
3.14

42
Design and implementation
Figure 3.13: Normalized number of description words per class.

3.2. Classical methods
43
Figure 3.14: Normalized access right per class.
that both ham and spam are almost identical with respect to the access right.
text :
This attribute contains the full text of the keywords, title and description. All punctuation
marks and HTML tags have been removed.
text 4000 :
As it can be seen in Figure
3.13
the amount of records with more than 3000 words is small.
The same happens for the title after 50 words (see Figure
3.12
) and 10 for the keywords (see
Figure
3.2
). In addition, according to the literature the valuable information is contained on
the beginning of the text corpus. Since no specific quantities were specified, sane values are
chosen: 3500 for the description, 400 for the title, and 100 for the keywords. Moreover, having
really large texts can become a problem when processing them as they will become even larger
vectors. Full text cases of more than 25000 words were seen in the dataset. To tackle this
problem the text in this feature has been reduced to a maximum of 4000 words, using the
previously mentioned limits. Note that this limits leave a small margin compared to what was
seen in the figures 3000, 50 and 10 respectively.
3.2.2
Random Forest based models
Three models were trained one using the extracted features, one using the combination of the
text corpus (keywords, title, description) in full length, and another one with the reduced to
4000 words corpus. In all cases the hyperparameters of the existing Zenodo classifier were used.

44
Design and implementation
Figure 3.15: Model A confusion matrix.
Figure 3.16: Model B confusion matrix.
These are: 100 as number of estimators and 4 as number of jobs.
For the random forest with extracted features as input, the variables were encoded using
one-hot-encoding method, while the text based content was vectorised using TF-IDF in the
same manner than on the existing classifier: with a total of 8000 features and an ngram range
of (1, 1). English stop words were removed when creating the TF-IDF vectorization.
For the sake of simplicity we will refer as model A to the one that uses the extracted features,
model B to the full text based one, and model C to the 4000 words text model.
The difference in accuracy for the ham class is small, model A obtained a 99.89% of accuracy
while model B and C obtained a 99.97%. However, for the spam class model B and C perform
significantly better with a 91.90% and 91.89% of accuracy against a 88.53%. The confusion
matrices are shown in Figures
3.15
,
3.16
, and
3.17
.
On the other hand the training and prediction times of model A are significantly better
than model B and C. Model A took approximately 6 minutes to be trained and 2 seconds to

3.2. Classical methods
45
Figure 3.17: Model C confusion matrix.
Model
Ham
Spam
Training time
Prediction time
Feature extraction (A)
0.9989
0.8853
6.1min
2s
Full text (B)
0.9998
0.9190
47.4min
25.3s
Text 4000 (C)
0.9998
0.9189
45.2min
24.3s
Table 3.6: Random Forests comparison.
predict one third of the dataset, while model B and C took more than 45 minutes for training
and 24 seconds for prediction. Note that the training time for model B and C includes the
TF-IDF vectorization, which was 14 and 13 minutes respectively. Table
3.6
shows a summary
of these three models.
In order to verify levels of contamination, those records which were originally classified
as ham but detected as spam by the three models were manually checked. The intersection
consisted of 13 records, which resulted to be rightful content (a classification mistake made by
all models).
3.2.3
Conclusions and future work
The descriptive analysis showed that some knowledge gathered from supporters’ and devel-
opers’ experience was correct (e.g. spam records contain a file). On the other hand, other
attributes were shown to not separate between classes (e.g. access right ). However, they might
do so when combined with others, and so they were kept on the produced dataset. Some of
features that proved to be useful in the literature studies, like the number of words were ex-
tracted. Nonetheless, many others could not be extracted due to the lack of attributes with the
information. Such is the case of creation time which could lead to a time series based analysis,

46
Design and implementation
and user related features.
This led to the creation of new classifier based on Random Forest, which used the same
hyperparameters. This new model (model A), obtained a similar accuracy to full text ones
(model B and C) for ham class, but is 3.37% less accurate for the spam class. On the other
hand, if training and prediction speed is an important aspect model A is many orders magnitude
faster than model B and C. In consequence, a first spamicity check in pseudo real-time, as well
as fast re-train, could be carried out using this new model (model A). It is important to note,
that features are more easily learnt and faked than the text in the record, therefore spammers
could more easily circumvent the new classifier.
To conclude, model C (4000 words text Random Forest with TF-IDF vectorization) does
not present significant differences compared to model B (full text Random Forest with TF-IDF
vectorization). However, it does consumed less memory when stored and results in smaller
vectors, and important fact for the Neural Network based classifiers. Therefore, it is the one
that will be used from now on in this work.

3.3. Neural Networks
47
3.3
Neural Networks
The length of the text did not have a great impact on the accuracy of the Random Forest
models.
However, it affects significantly the memory and time performance of the neural
networks. Therefore, the 4000 words text will be used for this section.
In addition, the dataset is highly unbalanced with approximately 37 thousand spam records
and 1690 thousand ham records (1.6 million, 45 times the amount of spam). A simple NN
with 2 dense layers and a small scale VGG network (2x8 plus 2x16 1D convolutional layers
with kernel size 3, using max pooling with default pool size and a dropout of 0.1) were trained
with whole dataset. Both of them obtained between 97% and 98% accuracy. However, when
checking the accuracy per class, it could be seen that the spam class had almost a 0% accuracy,
meaning that the model had learn only the ham class. This makes sense since 37K records
are approximately a 2% of the full dataset (1.7M). To deal with this problem a new balanced
dataset was generated, containing 37K records for each class (a total of 64K).
The new balanced dataset was created by under sampling the majority class (ham). To
be certain that there is no loss of information the Condensed Nearest Neighbors, or CNN for
short, under sampling technique was tested. It is technique that seeks a subset of a collection
of samples that results in no loss in model performance, referred to as a minimal consistent set.
However, this process is very slow and not easily parallelizable, it was stopped after 4 days of
run time. Therefore, the dataset was balanced using random under sampling.
To make use of word embeddings the resulting dataset was encoded using a hashing tech-
nique (called one hot encoding by Keras), which assigns a integer number to each different
word in the vocabulary. In addition, the vectors were padded to have all the same length.
To visualize the data points in 2D and 3D, this vectors were reduced to two and three
components using PCA. The results can be seen in Figure
3.18
and Figure
3.19
. These figures
show that both classes of records seem to be highly similar.
Finally, the dataset was split into training, validation and test so all the networks use the
same dataset and the results are comparable. The resulting datasets contain 47607, 5290, and
22671 records respectively.
3.3.1
Neural Networks form the literature
Several network configuration based on the work from Jain et al. are tested on the balanced
dataset. This includes convolutional neural networks [
10
], recurrent neural networks [
11
] and
a combination of both of them [
12
]. Due to the multi-language nature of the dataset, an
embedding layer of 200 dimensions was added to the networks, instead of using pre-trained
ones. In addition the learning rate for the optimizers (Adagrad) was left with its default value

48
Design and implementation
Figure 3.18: Balanced dataset 2D representation.
Figure 3.19: Balanced dataset 3D representation.

3.3. Neural Networks
49
(0.1), so was the batch size (32). All models were trained for 10 epochs as stated in the
literature.
3.3.1.1
CNN using word embeddings
The configurations used by Jain et al. [
10
] are presented in Table
3.7
. These networks are built
using Convolutional 1 Dimensional (Conv1D) filters.
A
B
Number of filters
128
54
Filter length
5
4
Dropout
0.1
0.2
Activation function
ReLu
ReLu
Optimizer
Adagrad (lr 0.1)
Adagrad (lr 0.1)
Epochs
10
10
Table 3.7: Literature CNN Networks configuration.
Both models were trained for 10 epochs. However they have margin to improve, since both
validation and training accuracy keep increasing and there is no hard sign of overfitting. This
is shown in Figures
3.20
and
3.21
.
Figure 3.20: Literature configuration A CNN training accuracy and loss.
In terms of performance compared to the Random Forests, both networks are trained in
approximately half of the time (16min vs 30min). However, this networks are trained on a much
smaller dataset, and its ability to generalize with the full dataset would need to be tested.

50
Design and implementation
Figure 3.21: Literature configuration B CNN training accuracy and loss.
A
B
Training time (avg per epoch)
116s
99s
Training accuracy
95.1%
94.8%
Test accuracy
95%
94.6%
Table 3.8: Literature CNN Networks training metrics.
Against the 22k records tests dataset mentioned at the beginning of the section, both models
performed better than the feature extraction Random Forest for about 2%. However, it is worth
noticing that contrary to the Random Forest models these neural networks do a better detecting
spam records by 4% or 5%, loosing approximately the same amount of accuracy on ham records.
Since one of the requirements is not to get false positives, the best performing model of the two
is model B. It has a 1% precision for the Spam class.
Ham
Spam
A
Precision
0.93
0.97
Recall
0.97
0.93
F1 Score
0.95
0.95
B
Precision
0.92
0.98
Recall
0.98
0.91
F1 Score
0.95
0.94
Table 3.9: Literature CNN Networks test metrics.

3.3. Neural Networks
51
3.3.1.2
CNN using TF-IDF vectorization
In order to compare the usefulness of word embeddings the same two CNN configurations
were tested on TF-IDF vectors. Both with full and 4000 words text, resulting in four models.
However, the differences of the results between dataset were very similar or even slightly better
when using the 4000 words one. For that reason the information presented below corresponds
to the two models trained on that dataset.
Contrary to the one hot encoded vectors, the TF-IDF ones show a certain distinction be-
tween classes when reduced to two and three components using PCA. This is shown in Figures
3.22
and
3.23
.
Figure 3.22: Balanced dataset 2D representation of TF-IDF vectorization.
Both models were trained for 10 epochs. Configuration B seems to have reach the best
possible performance on epoch 3, while configuration B seems to always overfit. This is shown
in Figures
3.24
and
3.25
.
In terms of performance, both networks are trained faster than any of the other methods
(including the Random Forests) taking between 3 to 5 minutes. However the accuracy, is
significantly lower, by more than 10% as shown in Table
3.10
.
A
B
Training time (avg per epoch)
27s
20s
Training accuracy
85.1%
85.2%
Test accuracy
80.9%
78.4%
Table 3.10: Literature CNN Networks with TF-IDF vectorization training metrics.
It is worth mentioning that both networks manage to detect spam with a 100% accuracy,
which could be due to overfitting. On the other hand, they reduce by more than 10% the

52
Design and implementation
Figure 3.23: Balanced dataset 3D representation of TF-IDF vectorization.
Figure 3.24: Literature configuration A with TF-IDF vectors training accuracy and loss.
accuracy of the ham class.

3.3. Neural Networks
53
Figure 3.25: Literature configuration B TF-IDF vectors training accuracy and loss.
Ham
Spam
A
Precision
0.73
1.00
Recall
1.00
0.70
F1 Score
0.87
0.83
B
Precision
0.70
1.00
Recall
1.00
0.56
F1 Score
0.82
0.72
Table 3.11: Literature CNN Networks with TF-IDF vectorization test metrics.
3.3.1.3
RNN using word embeddings
The configurations used by Jain et al. [
11
] are presented in Table
3.12
. These networks are
built using Long Short Term Memory (LSTM) units.
A
B
Number of units
100
100
Dropout
0.1
0.2
Activation function
Sigmoid
Sigmoid
Optimizer
Adagrad (lr 0.1)
Adagrad (lr 0.1)
Epochs
10
10
Table 3.12: Literature RNN Networks configuration.
Both models were trained for 10 epochs. As it can be seen in Figures
3.26
and
3.27
, the val-

54
Design and implementation
idation accuracy drastically decreases (and the loss increases) around epoch 5 for configuration
A and epoch 7 for configuration B. However, they seem to be local minimums since afterwards
these values are above the training ones and therefore might get better results if trained during
more epochs.
Figure 3.26: Literature configuration A RNN training accuracy and loss.
In terms of performance compared to the Random Forests, both networks are trained in
approximately double of the time (60min vs 30min).
Against the 22k records tests dataset mentioned at the beginning of the section and com-
pared to the feature extraction Random Forest, configuration A does not give significant im-
provements while configuration B performs better by about 2%. Focusing on configuration B,
it is worth noticing that contrary to the Random Forest models these neural networks do a
better job detecting spam records by 6%, loosing approximately the same amount of accuracy
on ham records. In this case, configuration B is chosen.
Figure 3.27: Literature configuration B RNN training accuracy and loss.

3.3. Neural Networks
55
A
B
Training time (avg per epoch)
369s
365s
Training accuracy
92.5%
96%
Test accuracy
92.3%
95.7%
Table 3.13: Literature RNN Networks training metrics.
Ham
Spam
A
Precision
0.87
0.99
Recall
0.99
0.85
F1 Score
0.93
0.93
B
Precision
0.93
0.99
Recall
0.99
0.92
F1 Score
0.96
0.96
Table 3.14: Literature RNN Networks test metrics.
3.3.1.4
Combining CNN, RNN and word embeddings
The configurations used by Jain et al. [
12
] are presented in Table
3.15
. For short, from now on
the mix configuration of convolutional and recurrent networks will referred to as CRNN.
A
B
Number of filters (Conv1D)
128
54
Filter length (Conv1D)
5
4
Activation function
ReLu
ReLu
Number of units (LSTM)
100
100
Dropout
0.1
0.2
Optimizer
Adagrad (lr 0.1)
Adagrad (lr 0.1)
Epochs
10
10
Table 3.15: Literature CRNN Networks configuration.
Both models were trained for 10 epochs. Configuration A might have reached the best
performance on epoch 8 since afterwards it starts decreasing, although this might be a local

56
Design and implementation
minimum as it happens on the 7th epoch of configuration B. Both models might achieve better
results if trained for more epochs. This is shown in Figures
3.28
and
3.29
.
Figure 3.28: Literature configuration A CRNN training accuracy and loss.
In terms of performance compared to the Random Forests, both networks are trained in
approximately one third more of time (40min vs 30min). However, this networks are trained
on a much smaller dataset.
A
B
Training time (avg per epoch)
250s
220s
Training accuracy
97%
96.4%
Test accuracy
96.5%
96%
Table 3.16: Literature CRNN Networks training metrics.
Figure 3.29: Literature configuration B CRNN training accuracy and loss.

3.3. Neural Networks
57
Against the 22k records tests dataset mentioned at the beginning of the section, both
models performed better than the feature extraction Random Forest for about 3%. However, it
is worth noticing that contrary to the Random Forest models these neural networks do a better
job detecting spam records by 8% reaching the 99%, loosing only a 4% or 6% of accuracy on
ham records. Since the precision for spam records is the same on both models, but configuration
A performs better on the ham class this is the one chosen on this section.
Ham
Spam
A
Precision
0.95
0.99
Recall
0.99
0.94
F1 Score
0.97
0.96
B
Precision
0.93
0.99
Recall
0.99
0.93
F1 Score
0.96
0.96
Table 3.17: Literature CRNN Networks test metrics.
3.3.1.5
Literature networks conclusions and future work
In terms of vectorization, and even though is not perceptible to the human eye through 2D and
3D representations, word embeddings manage to interpret the one hot encoded vectors resulting
on an improvement of around 10% accuracy. From all the tested configurations, the CRNN
network with configuration A was the best performant one with 97% accuracy on training and
96.5% on testing, and a high 99% precision on the spam class, which means a small amount
of false positives. Therefore, this network configuration will be further investigated in the
following sections.
On almost all the networks which used word embeddings, more epochs of training might be
needed to reach their full potential. In addition, all networks had a better precision and recall
on the spam class, this could be due to the different languages present on the content, and will
be tested on Section
CRNN on English-only content
.
3.3.2
CRNN on English-only content
The models presented in the previous section achieve a high accuracy for the spam records, but
seem to be incapable of generalizing properly the ham class. One possible reason would be the
multi-language nature of the data. The original dataset was highly unbalanced, and it possible

58
Design and implementation
that when creating a balanced dataset using random selection on the majority class a few ham
records were taken of a language that is not represented on the spam ones.
3.3.2.1
Language analysis
The first step is to see the language distribution of the balanced dataset. This is done using
the langdetect Python library.
As it can be seen in Figure
3.30
the big majority of the content is in English, with ap-
proximately 55K records, followed by French and German both of them with less than 5K
records.
Figure 3.30: Language distribution on the full balanced dataset.

3.3. Neural Networks
59
Figure 3.31: Language distribution on the ham records.
However, in Figures
3.31
and
3.32
it can be seen that the distribution on non-English
languages varies significantly between the two target classes (e.g. German records are mostly
ham, and French records are mostly spam). Therefore, a new dataset with only English records
was created. It contains 31018 ham and 25172 spam records (a total of 56190).
Once again the PCA reduction was preformed to try to visually verify the difference between
datasets. Figures
3.33
and
3.34
show that, as with the multi-language content, there is no
evident differences.

60
Design and implementation
Figure 3.32: Language distribution on the spam records.

3.3. Neural Networks
61
Figure 3.33: English dataset PCA reduction to 2D.
Figure 3.34: English dataset PCA reduction to 3D.

62
Design and implementation
Figure 3.35: CNN Configuration A on English content.
Figure 3.36: CNN Configuration B on English content.
3.3.2.2
Running literature models on English content.
On this case, while the CNN networks seem to need more epochs of training (Figure
3.35
and
3.36
), the CRNN (Figure
3.39
and
3.40
) seem to reach its limits on the 6th epoch. The RNN
might still need more epochs (Figure
3.37
). However, configuration B behaved erratically as
shown in Figure
3.38
, most likely due to the increased dropout of this configuration or a fast
learning rate (default, 0.1). Note that his model was run several times with similar results.
In terms of precision, no model behaved better than the chosen one of the previous section
(CRNN with configuration A). CNNs are between 3% and 4% less accurate. Although still not
better than previous models, RNN improve their accuracy by approximately 10%. The closest
one is CRNN with configuration B, which obtained the same results but a 0.01% less recall on
the spam class.
It is worth mentioning, that since the content was English only, stop words were removed.
This reduced the size of the vectors to a third (from 4.3K to 1.5K in length). The final impact

3.3. Neural Networks
63
Figure 3.37: CRNN Configuration A on English content.
Figure 3.38: CRNN Configuration B on English content.
Figure 3.39: RNN Configuration A on English content.

64
Design and implementation
Figure 3.40: RNN Configuration B on English content.
was a speed up on all models, which were trained in between half and a third of the time than
the multi-language ones.
Overall, these models decreased in accuracy. Therefore, one could question if the stopwords
actually make a difference between ham and spam. As a consequence the 6 models were trained
over English only content including the stopwords. As in the previous models, the metrics per
epoch show that more training could improve the accuracy. However, using the 10 epochs
stated by the original papers, the accuracy on English content with stopwords only differ on
the configuration A of the RNN and CRNN, which got a 6% and 1% increase on the ham class
respectively.
3.3.2.3
Conclusions on English only content
Since no model got significantly better results than in the previous section, it can be concluded
that the language is not preventing the models to classify correctly that 3% or 4% missing,
mostly on the ham class. There is a big chance that it is simply due to the lack of training
data for such cases.
Concerning a per-language classification, it would be interesting to see the usage of stop-
words. A classification removing them or using them along with the text has been done. How-
ever, what happens when only stopwords are used is still unclear. This could be, for example,
a feature to extract an use in a Random Forest.
3.3.3
Deeper Neural Networks
In the previous sections the results of two predefined configurations over the three types of
networks (CNN, RNN and CNN+RNN) were shown. In many of those cases the training
results showed that more epochs could result in a classification improvement. In addition,

3.3. Neural Networks
65
Figure 3.41: Deep NN configuration I metrics.
Ham
Spam
Precision
0.95
0.99
Recall
0.99
0.94
F1 Score
0.97
0.96
Table 3.18: Performance metrics of configuration A CRNN.
the two proposed configurations in the paper use a big amount of units for the LSTM and
filters for the convolutional layers, which result in longer training time. In this section, network
configurations with fewer units and filters, but more layers and epochs, different optimizers
(Adam or Adagrad), and embedding sizes (200 and 300), are trained. For example, several
VGG like configurations starting by 8, 16 or 32 filters; or networks as deep as 12 layers with 4
or 8 filters were tested. In most of the cases LSTM layers were added, however tests without
these were also conducted. The full list of configurations is available in Annex A
A
.
In general terms, deeper networks tend to reach the maximum accuracy in less epochs.
Usually around the 5th epoch, as show in Figure
3.41
. However, they were trained during
between 10 and 20 epochs, to discard the possibility of local minimums.
In Table
3.19
the results obtained by the six best performant models are shown (Their
configurations are shown in the Annex A
A
, along with the other 24 tested models). Table
3.18
shows the results obtained by the best performant configuration of Section
3.3.1
.
As it can be seen the difference happens mostly on the ham class. All models except the
last one (Model VI ) perform worse by between 1% and 3%. However, Model VI obtains the
same results and improves the F1 Score of the spam class by 1%. In addition, this networks
trains significantly faster, with an average of 210s per epoch against the 250s of the network

66
Design and implementation
Model
Precision
Recall
F1 Score
Epochs
Training time (per epoch)
I
Ham
0.93
0.99
0.96
20
215s
Spam
0.99
0.93
0.96
II
Ham
0.95
0.98
0.97
20
140s
Spam
0.98
0.95
0.96
III
Ham
0.95
0.99
0.97
20
170s
Spam
0.98
0.95
0.97
IV
Ham
0.95
0.98
0.97
20
220s
Spam
0.98
0.95
0.97
V
Ham
0.94
0.99
0.96
20
203s
Spam
0.97
0.94
0.96
VI
Ham
0.95
0.99
0.97
20
210s
Spam
0.99
0.95
0.97
Table 3.19: Deeper Neural Networks metrics.
proposed on the literature. The configuration of this model is similar to the VGG network, and
is as follows:
• Embedding layer of 300 dimensions.
• Convolutional layer, with 16 filters of size 5
• Convolutional layer, with 16 filters of size 5
• Max pooling layer of size 2
• Convolutional layer, with 32 filters of size 5
• Convolutional layer, with 32 filters of size 5
• Max pooling layer of size 2
• Long Short Term Memory layer with 100 units
• Dropout layer with 0.1 rate
• Flatten layer
• Dense layer with 2 units (one per targe class)

3.3. Neural Networks
67
Ham
Spam
Precision
1.00
0.06
Recall
0.71
0.87
F1 Score
0.83
0.12
Table 3.20: Performance metrics of configuration B CNN on the full dataset.
Ham
Spam
Precision
0.99
0.06
Recall
0.73
0.83
F1 Score
0.84
0.12
Table 3.21: Performance metrics of configuration B RNN on the full dataset.
3.3.4
Testing the whole dataset
The original dataset is highly imbalanced, and this had several negative aspects in the training
of the neural networks (These problems are explained at the beginning of Neural Networks
section (Section
3.3
). Therefore, the ability to generalize of the obtained models needs to be
tested on the full dataset. On the spam records, not much difference should appear since all of
them were used on the balanced dataset. However, the class contains more than a 1.6 million
records that were not used on the training.
Firstly, we test the best models from the literature. Those are the CNN with configuration
B, the RNN with configuration B and the CRNN with configuration A. Their results are shown
in Tables
3.20
,
3.21
, and
3.22
respectively.
In addition, the metrics for the best performant model of the deeper neural networks (Section
3.3.3
) are shown in Table
3.23
.
Finally, other models which obtained almost perfect results (100% precision), but there
overfit was suspected were tested to observe their ability to generalize. This are shown in Tables
Ham
Spam
Precision
1.00
0.06
Recall
0.70
0.85
F1 Score
0.82
0.11
Table 3.22: Performance metrics of configuration A CRNN on the full dataset.

68
Design and implementation
Ham
Spam
Precision
0.99
0.06
Recall
0.72
0.77
F1 Score
0.84
0.11
Table 3.23: Performance metrics of deep NN model VI on the full dataset.
Ham
Spam
Precision
0.99
0.11
Recall
0.92
0.47
F1 Score
0.95
0.18
Table 3.24: Performance metrics of deep NN model XXVIII on the full dataset.
3.24
and
3.25
. The corresponding model number is shown in the caption and its configuration
can be seen in the Annex
A
.
It can be seen that in all cases the precision of the spam class is very low, reaching a
maximum of 11%. The recall seems to be also very low in the last two tested models. It can
be concluded that whereas the models will predict the ham class successfully they will produce
a very high amount of false positives.
Ham
Spam
Precision
0.98
0.07
Recall
0.96
0.15
F1 Score
0.97
0.09
Table 3.25: Performance metrics of deep NN model XXX on the full dataset.

Chapter 4
Final conclusions and future work
The work done in this master thesis studies the suitability of neural networks to tackle the
spam problem in general purpose institutional repositories, specifically Zenodo.
As a first step, the usage of Random Forest based on the already existing spam classifier
is tested. Achieving an almost perfect score of 99.98% accuracy for ham content, but a lower
91.90% for spam content. Taking around 45 minutes to train, and 25 seconds to predict one third
(533k) of the records. Apart from setting the base ground for accuracy and time performance
for the neural networks, these classifiers were trained also using a reduced text corpus, which
proved that 4000 words are enough text to obtain an accurate classification. Nonetheless, it is
possible that this number can be reduced, further work on this aspect would result in smaller
vectors and faster training and predicting times, especially in the neural networks. Finally,
features based on Zenodo’s supporters experience and literature were extracted generating a
new dataset of categorical data. Another Random Forest based classifier was trained using this
dataset. Although, the time performance was 7.5 times faster and the accuracy on the ham
class remained the same, it decreased almost a 3% the accuracy for spam records.
Concerning the neural networks, the state of the art showed that similar methods to those
used successfully in computer vision could achieve good results also for natural language pro-
cessing tasks, such as text classification. Many studies managed to predict spam on SMS,
Twitter and other social networks and news content with high accuracy. Particularly, the work
of Jain et al., who studied the use convolutional neural networks [
10
], recurrent neural networks
[
11
] and a combination of both of them [
12
], achieving a 99.01% and 95.48% accuracy on SMS
and Twitter datasets respectively.
The models proposed by Jain et al. were tested on a balanced subset of Zenodo’s data,
obtaining high results (99% precision) for the spam class and a bit lower (93 to 95% precision)
for the ham class. In addition, the influence of language (e.g. English vs other languages) and
stopwords was tested, and proved to not have a significant effect on the prediction performance.
69

70
Final conclusions and future work
It could be observed that the model configurations proposed by Jain et al. contain a few
layers with a big amount of units or filters. Therefore, models with fewer units or filters but
deeper in layers were trained. Testing also different architectures (e.g. VGG like), optimizers
(e.g. Adam, Adagrad) and embedding sizes (e.g. 200, 300). As a result, 30 configurations were
trained. Obtaining similar results than Jain et al. being the only difference a 1% on the F1
Score of the spam class. However, this network trains significantly faster, with an average of
210s per epoch against the 250s of the network proposed in the literature.
Finally, the literature models and the best deeper models were tested against the whole
dataset. All of them showed poor performance. While the ham class achieved high precision
and in some cases also high recall, the spam class precision was very low, in some cases close to 0.
This means, that the amount of false positives is very high. Once again this is a consequence of
the imbalanced nature of the dataset and the technique used to balance it. It is suspected that
not all record clusters were represented in the undersampled set. There are several potential
solutions: One would be to perform a clustering of the ham records and perform a uniformly
distributed sampling of those (note that it was preferred, but due to time constraints it was not
possible). Another option would be to train the models with more subsets of the ham records
until a suitable result is obtained upon prediction. Yet another option would be to oversample
the spam class to reach a balanced 1.6M records. However, this might result in an overfit of
said class since the amount of original spam records is much smaller (45 times).
4.1
Conclusion
Neural networks have proven to achieve good results (around 99% accuracy and precision) on
a small subset of the dataset. However, they need to be trained with more data in order to be
appropriate for a production environment. This comes with the challenge of obtaining a sensible
balanced dataset with all types of records represented. Therefore, the production setup should
still be the Random Forest. However, certain improvements could be made such as reduction
of the number of words or the treatment of those (removal of punctuation marks, etc.). These
are described in more details in Section
4.2
.
4.2
Future work
There are several lines of work that can get out of this work. In terms of data processing and
features extraction, it would be interesting to be able to correlate the data with user data such
as: How old is the user in the system? How many records has it published? How may com-
munities does it belong to? Where do the spammers come from (via e.g. IP geolocalisation)?.

4.2. Future work
71
Moreover, the creation time of the record would allow a time series analysis which could profile
the time, and from that the location, of the spammers. Finally, the data could be enriched with
checks for the veracity of external identifiers, such as ORCID, although this might be useful to
the repository itself and should be done by it.
Concerning the multi-language nature of the data, it would be interesting to analyze the
presence of stopwords in the data itself. Answering to the question: Do spammers use more (or
less) stopwords than legitimate publishers? This could become a new extracted features (e.g.
number of stopwords present in the record).
Regarding the suitability of the model for a production environment, as stated before the
model is not ready since it gives a big amount of false positives. However, this effect could be
minimized if used as part of an ensemble with a meta classifier taking the decision from the
output of several classifiers. In times also getting back the correct result to train the NN model.
Another alternative would be the used of a pipeline, using first the NN model and passing to
the RF those who are positive; considering them as spam if both model give a result above a
certain threshold (to be defined). Then again retro-feeding the model training with the correct
result. Finally, external tools such as Google BERT
1
could be tested on the data; and even
add it to the meta ensemble or the pipeline.
1
https://github.com/google-research/bert

72
Final conclusions and future work

Chapter 5
Deployment in production
Industry algorithms are often deployed on the cloud, using technologies such as Amazon Sage-
Maker
1
, Google Cloud AI
2
, IBM Watson
3
or other similar products. Independently of the
hardware or software stack used by these products, they all have common characteristics:
• They expose a RESTful API to submit requests, such as the prediction of a record.
• Scale along with the demand, in order to provide almost real-time responses, or return a
promise and respond as fast as possible.
Zenodo is deployed in CERN’s Data Center
4
, and therefore the classifier deployment must
be done using CERN’s infrastructure. As a consequence, the available technology is reduced
to GPU Virtual Machines on OpenStack
5
, with up to 32GB of memory, and containerized
applications deployments on OpenShift
6
.
Due to the restrictions presented by CERN’s infrastructure, the deployment will be done
ad-hoc and the RESTful API will need to be developed. Keras provides a tutorial
7
on how to
develop a RESTful API using Flask, the same technology on which Invenio and Zenodo are
built-on and therefore it would be easy to integrate in the code base. This could be deployed as
a containerized application on OpenShift, and make use of the built-in AutoScaler component
8
to be able to scale on-demand.
The re-training of the model would be carried out on a GPU Virtual Machine on OpenStack
1
https://aws.amazon.com/es/sagemaker/
2
https://cloud.google.com/ai-platform
3
https://www.ibm.com/cloud/machine-learning
4
https://about.zenodo.org/infrastructure/
5
https://clouddocs.web.cern.ch/gpu/README.html
6
https://information-technology.web.cern.ch/services/paas-web-app
7
https://blog.keras.io/building-a-simple-keras-deep-learning-rest-api.html
8
https://docs.openshift.com/container-platform/3.9/dev
g
uide/pod
a
utoscaling.html
73

74
Deployment in production
to be able to perform it in a considerable time. To monitor said training process Tensorflow-
Board
9
could be used.
9
https://www.tensorflow.org/tensorboard

Bibliography
[1] Aliaksandr Barushka and Petr Hajek.
Spam filtering using integrated distribution-
based balancing approach and regularized deep neural networks. Applied Intelligence,
48(10):3538–3556, 2018.
[2] Johannes M Bauer, Michel JG Van Eeten, Tithi Chattopadhyay, and Yuehua Wu. Itu
study on the financial aspects of network security: Malware and spam. ICT Applications
and Cybersecurity Division, International Telecommunication Union, Final Report, July,
2008.
[3] Andras A Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. Spamrank–fully
automatic link spam detection work in progress. In Proceedings of the first international
workshop on adversarial information retrieval on the web, pages 1–14, 2005.
[4] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri.
Know your neighbors: Web spam detection using the web topology. In Proceedings of
the 30th annual international ACM SIGIR conference on Research and development in
information retrieval, pages 423–430, 2007.
[5] James Clark, Irena Koprinska, and Josiah Poon. A neural network based approach to
automated e-mail classification. In Proceedings IEEE/WIC International Conference on
Web Intelligence (WI 2003), pages 702–705. IEEE, 2003.
[6] Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann Lecun. Very deep convolutional
networks for text classification. arXiv preprint arXiv:1606.01781, 2016.
[7] Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. Black-box generation of adver-
sarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy
Workshops (SPW), pages 50–56. IEEE, 2018.
[8] Kwang Leng Goh and Ashutosh Kumar Singh. Comprehensive literature review on machine
learning structures for web spam classification. Procedia Computer Science, 70:434–441,
2015.
75

76
BIBLIOGRAPHY
[9] Hongmei He, Tim Watson, Carsten Maple, J¨
orn Mehnen, and Ashutosh Tiwari. A new
semantic attribute deep learning with a linguistic attribute hierarchy for spam detection.
In 2017 International Joint Conference on Neural Networks (IJCNN), pages 3862–3869.
IEEE, 2017.
[10] Gauri Jain, Manisha Sharma, and Basant Agarwal. Spam detection on social media using
semantic convolutional neural network. International Journal of Knowledge Discovery in
Bioinformatics (IJKDB), 8(1):12–26, 2018.
[11] Gauri Jain, Manisha Sharma, and Basant Agarwal. Optimizing semantic lstm for spam
detection. International Journal of Information Technology, 11(2):239–250, 2019.
[12] Gauri Jain, Manisha Sharma, and Basant Agarwal. Spam detection in social media using
convolutional and long short term memory neural network. Annals of Mathematics and
Artificial Intelligence, 85(1):21–44, 2019.
[13] R Jennings. Cost of spam is flattening, 2010.
[14] Chris Kanich, Nicholas Weaver, Damon McCoy, Tristan Halvorson, Christian Kreibich,
Kirill Levchenko, Vern Paxson, Geoffrey M Voelker, and Stefan Savage. Show me the
money: Characterizing spam-advertised revenue. In USENIX Security Symposium, vol-
ume 35, 2011.
[15] Vijay Krishnan and Rashmi Raj. Web spam detection with anti-trust rank. In AIRWeb,
volume 6, pages 37–40, 2006.
[16] Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT
technology journal, 22(4):211–226, 2004.
[17] Sreekanth Madisetty and Maunendra Sankar Desarkar. A neural network-based ensem-
ble approach for spam detection in twitter. IEEE Transactions on Computational Social
Systems, 5(4):973–984, 2018.
[18] Juan Martinez-Romo and Lourdes Araujo. Web spam identification through language
model analysis. In Proceedings of the 5th international workshop on adversarial information
retrieval on the web, pages 21–28, 2009.
[19] Michael Mccord and M Chuah. Spam detection on twitter using traditional classifiers. In
international conference on Autonomic and trusted computing, pages 175–186. Springer,
2011.

BIBLIOGRAPHY
77
[20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[21] George A Miller. Wordnet: a lexical database for english. Communications of the ACM,
38(11):39–41, 1995.
[22] Harold Nguyen.
2013 state of social media spam.
Publication of NexGate,
USA, from websites http://nexgate.com/wpcontent/uploads/2013/09/Nexgate-2013-State-
of-Social-Media-Spam-Research-Report.pdf, 2013.
[23] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors
for word representation. In Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP), pages 1532–1543, 2014.
[24] D Karthika Renuka, T Hamsapriya, M Raja Chakkaravarthi, and P Lakshmi Surya. Spam
classification based on supervised learning using machine learning techniques. In 2011 In-
ternational Conference on Process Automation, Control and Computing, pages 1–7. IEEE,
2011.
[25] Pradeep Kumar Roy, Jyoti Prakash Singh, and Snehasish Banerjee. Deep learning to filter
sms spam. Future Generation Computer Systems, 102:524–533, 2020.
[26] Nikita Spirin and Jiawei Han. Survey on web spam detection: principles and algorithms.
ACM SIGKDD explorations newsletter, 13(2):50–64, 2012.
[27] S Sumathi and Ganesh Kumar Pugalendhi. Cognition based spam mail text analysis using
combined approach of deep neural network classifier and random forest. JOURNAL OF
AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020.
[28] Krysta M Svore, Qiang Wu, Chris JC Burges, and Aaswath Raman. Improving web spam
classification using rank-time features. In Proceedings of the 3rd international workshop
on Adversarial information retrieval on the web, pages 9–16, 2007.
[29] Dave C Trudgian. Spam classification using nearest neighbour techniques. In Interna-
tional Conference on Intelligent Data Engineering and Automated Learning, pages 578–
585. Springer, 2004.
[30] Onur Varol, Emilio Ferrara, Clayton A Davis, Filippo Menczer, and Alessandro Flam-
mini. Online human-bot interactions: Detection, estimation, and characterization. arXiv
preprint arXiv:1703.03107, 2017.

78
BIBLIOGRAPHY
[31] Alex Hai Wang.
Don’t follow me: Spam detection in twitter.
In 2010 international
conference on security and cryptography (SECRYPT), pages 1–10. IEEE, 2010.
[32] Tingmin Wu, Shigang Liu, Jun Zhang, and Yang Xiang. Twitter spam detection based on
deep learning. In Proceedings of the australasian computer science week multiconference,
pages 1–8, 2017.
[33] Sihong Xie, Guan Wang, Shuyang Lin, and Philip S Yu. Review spam detection via tem-
poral pattern discovery. In Proceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 823–831, 2012.
[34] Bo Yu and Zong-ben Xu. A comparative study for content-based dynamic spam classifi-
cation using four machine learning algorithms. Knowledge-Based Systems, 21(4):355–362,
2008.

Appendix A
Deep Neural Networks configuration.
Model I
• Embedding, 300 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• Conv1D, 64 filters, size 5, activation ReLu
• Conv1D, 64 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• LSTM, 100 units
• LSTM, 100 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
79

80
Deep Neural Networks configuration.
Model II
• Embedding, 200 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 64 filters, size 5, activation ReLu
• Conv1D, 64 filters, size 5, activation ReLu
• Conv1D, 64 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 50 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
Model III
• Embedding, 200 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu

81
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 100 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
Model IV
• Embedding, 200 dimensions
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 100 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
Model V
• Embedding, 200 dimensions
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu

82
Deep Neural Networks configuration.
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 50 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
Model VI
• Embedding, 300 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 100 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax
Model VII
• Embedding, 300 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2

83
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model VIII
• Embedding, 300 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D

84
Deep Neural Networks configuration.
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model IX
• Embedding, 300 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model X
• Embedding, 250 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XI
• Embedding, 250 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu

85
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XII
• Embedding, 250 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 50 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XIII
• Embedding, 300 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 50 units
• Dropout, rate 0.15
• Flatten

86
Deep Neural Networks configuration.
• Dense, 2 units, activation softmax
Model XIV
• Embedding, 200 dimensions
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 32 filters, size 5, activation ReLu
• Conv1D, 32 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 30 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XV
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 30 units
• Dropout, rate 0.2

87
• Flatten
• Dense, 2 units, activation softmax
Model XVI
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 50 units
• Dropout, rate 0.15
• Flatten
• Dense, 2 units, activation softmax
Model XVII
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 10 units
• LSTM, 10 units
• Dropout, rate 0.15
• Flatten

88
Deep Neural Networks configuration.
• Dense, 2 units, activation softmax
Model XVIII
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 25 units
• LSTM, 25 units
• Dropout, rate 0.15
• Flatten
• Dense, 2 units, activation softmax
Model XIX
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• LSTM, 25 units
• Dropout, rate 0.15
• Flatten
• Dense, 2 units, activation softmax
Model XX

89
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• Conv1D, 8 filters, size 3, activation ReLu
• Conv1D, 8 filters, size 3, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• LSTM, 25 units
• Dropout, rate 0.15
• Flatten
• Dense, 2 units, activation softmax
Model XXI
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2

90
Deep Neural Networks configuration.
• LSTM, 50 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXII
• Embedding, 200 dimensions
• Conv1D, 8 filters, size 5, activation ReLu
• Conv1D, 8 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 16 filters, size 5, activation ReLu
• Conv1D, 16 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• LSTM, 50 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXIII
• Embedding, 200 dimensions
• Conv1D, 8 filters, size 3, activation ReLu
• Conv1D, 8 filters, size 3, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 16 filters, size 3, activation ReLu
• Conv1D, 16 filters, size 3, activation ReLu
• MaxPooling1D

91
• Dropout, rate 0.2
• LSTM, 50 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXIV
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXV
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu

92
Deep Neural Networks configuration.
• Conv1D, 4 filters, size 5, activation ReLu
• Conv1D, 4 filters, size 5, activation ReLu
• MaxPooling1D
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXVI
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• Conv1D, 4 filters, size 3, activation ReLu
• MaxPooling1D
• Dropout, rate 0.1
• LSTM, 20 units
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXVII
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D

93
• Dropout, rate 0.2
• Flatten
• Dense, 2 units, activation softmax
Model XXVIII
• Embedding, 200 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.25
• Flatten
• Dense, 2 units, activation softmax
Model XXIX
• Embedding, 300 dimensions
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• Conv1D, 4 filters, size 10, activation ReLu
• MaxPooling1D
• Dropout, rate 0.25
• Flatten
• Dense, 2 units, activation softmax
Model XXX
• Embedding, 300 dimensions
• Conv1D, 8 filters, size 8, activation ReLu
• Conv1D, 8 filters, size 8, activation ReLu

94
Deep Neural Networks configuration.
• MaxPooling1D
• Dropout, rate 0.2
• LSTM, 10 units
• Dropout, rate 0.1
• Flatten
• Dense, 2 units, activation softmax

Download 1.75 Mb.

Do'stlaringiz bilan baham:

1 2 3