Area: Deep Learning
Download 1.75 Mb. Pdf ko'rish
|
0-zenodo-spam-classifier
part of 98.99 98.91 description 0 0 journal 97.58 76.99 communities 98.2 52.53 publication date 0 0 owners 0 0 doi 0 0.02 license 1.21 2.23 notes 79.83 93.1 spam 0 0 recid 0 0 creators 0 0 resource type - - related identifiers 96.26 71.39 Table 3.1: Attribute presence in the dataset. 3.2. Classical methods 31 3.2 Classical methods In this section Random Forest based algorithms will be tested with TF-IDF encoded text data. In addition, features will be extracted from the previously selected attributes and a new dataset containing only categorical data will be generated. This new dataset will be tested on a Random Forest classifier. 3.2.1 Feature extraction As it was shown in the dataset description, many attributes are complex objects that cannot be processed by the classifier. Based on expert knowledge gathered from the Zenodo developers and supporters several features are extracted, and bar plots comparing the spam class of each new attribute are shown. Due to the imbalanced nature of the dataset, the displayed values are normalized in order to be displayed in a friendly manner. num keywords: The attribute keywords is a list of words. This new attribute represents the number of words in said list. The distribution was analysed using the Python library Seaborn 6 , and the values were classified in buckets in order to avoid high dimensional vectors in later stages. There were records with up to 250 keywords. However the amount of records with more than 10 was minimal (a total of 75K), and therefore they were put in the same bucket (10). This attribute can take its value from a total of 11 classes, from 0 to 10 (both included). The distribution per target class is shown in Figure 3.2 . num files: The files attribute is a list of objects, which contain very detailed information about each file (e.g. checksum). However, only the amount of files and its type have shown to be useful to the experts. The num files attribute represents the number of files associated to the record. The dataset contains records with a number of files that range from 0 to 144402. However, the amount of records diminishes greatly when the amount of files increases. As in the previous case, buckets were created. Resulting in 8 classes, from 0 to 7. The first 4 represent the exact amount of files. However, the last 4 are the represent a range which can be seen in Table 3.2 . The values shown in figure 3.3 confirm what was stated by the Zenodo supporters: Most spam records contain only one file. However, we can see that approximately the same amount of ham records contain a single file. On the other hand, it can be seen that a number of files 6 https://seaborn.pydata.org/ 32 Design and implementation Figure 3.2: Normalized number of keywords per class. class range 4 [4, 10) 5 [10, 30) 6 [30, 50) 7 [50, max] Table 3.2: Number of files ranges. with 4 or 5 files seem to predominate on spam records (note that 4 and 5 are ranges, and therefore mean between 4 and 30) has image: Using the filetype value of each file object, we can check if the record contains at least one file of one of the following types jpg, jpeg, png, bmp, gif, tiff, exif, ppm, pgm, pbm, pnm, webp, svg. Note that Zenodo does not provide a list of what is understood as image format. Therefore this list has been created in an ad-hoc manner according to the author’s knowledge. Figure 3.4 confirms Zenodo’s staff experience. It can be observed that about the amount of spam records with an image file is almost double than those without and image. num communities: This attribute represents the number of communities the record belongs to. Records have been shown to belong to between 0 and 8 communities. However, the amount of records 3.2. Classical methods 33 Figure 3.3: Normalized number of files per class. Figure 3.4: Normalized amount of records with an image file. 34 Design and implementation Figure 3.5: Normalized number of communities per class. belonging to more than 2 communities is negligible and therefore were all set into the same bucket. It can be seen in figure 3.5 that almost all spam records do not belong to a community. num creators: From the creators field we can extract the amount of them, and also some values that might increase the trust in the person publishing the record. These are identifiers (i.e. ORCID 7 ) and affiliations (e.g. An associated institution). This attribute represents the number of creators of the record. As in previous extracted features, buckets were created leaving a total of 8 possible classes (from 1 to 8). Being 6, 7 and 8 ranges as shown in Table 3.3 . class range 6 [5, 7) 7 [7, 10) 8 [10, max) Table 3.3: Number of creators ranges. In Figure 3.6 it can be seen that a significant portion of the spam records contain only one creator. 7 https://orcid.org 3.2. Classical methods 35 Figure 3.6: Normalized number of creators per class. creator has orcid : This attribute denotes the fact that at least one creator has an ORCID identifier. Figure 3.7 shows that there is no significant difference between ham and spam data with respect to orcid identifiers. It seems unrealistic that an entity such as ORCID was verifying an author submitting spam. This fact was verified with the Zenodo staff and it was understood that the identifiers are not verified against ORCID. creator has affiliation : This attribute denotes the fact that at least one creator has an affiliation. As with the previous feature, there is no significant difference between ham and spam data with respect to creator’s affiliation. This is shown in Figure 3.8 . type: The resource type attribute has a main type, and in some cases a subtype. Zenodo’s de- fault value in the uploads form is type publication with subtype article. Both Figure 3.9 and specifically 3.10 confirm that spammers do not put effort in changing this value. In addition, it can be observed that an important amount of content belongs to the resource type image which might contradict or conflict with the belief of Zenodo’s supported that ”most of spam only contain an image”. While, it might be true that spam records contain an image, it does not allow to distinguish from ham records by itself. The has image feature showed that about 35% of ham records also contain an image. 36 Design and implementation Figure 3.7: Normalized records with a creator identified by ORCID. Figure 3.8: Normalized records with a creator with an affiliation. 3.2. Classical methods 37 Figure 3.9: Normalized records by main resource type. 38 Design and implementation Figure 3.10: Normalized records by resource type and subtype. type full : The resource type expanded with subtype. Using the format type-subtype. license: In the original attribute there were 4 licenses that take up to 85% of the records. These were notspecified, cc-zero, CC-BY-4.0, CC-BY-SA-4.0, cc-by. Therefore, the rest were set as other. Null values are a separate category (no-license), which leaves a total of 7. The distribution is shown in Figure 3.11 . Zenodo’s upload form default value is CC-BY-4.0, which shows once again that spammers do not put the effort in changing defaults. 3.2. Classical methods 39 Figure 3.11: Normalized records by license. num words title: Literature stated that the amount of words used in the text could distinguish spam from ham. Therefore, both title and description amount of words were extracted. Note that the text was cleaned from HTML tags. After observing the distribution several buckets were created, resulting in a total of 31 possible classes, from 0 to 24 plus the ranges shown in Table 3.4 . class range 25 [25, 30) 30 [30, 35) 35 [35, 40) 40 [40, 45) 45 [45, 50) 50 [50, max) Table 3.4: Number of words in the title ranges. Figure 3.12 shows that the distribution of the amount of words in the title is slightly different. Spam records tend to contain between 5 and 15 words, while ham has its peak at 4 and decreases significantly afterwards. In addition, the amount of spam records with more than 20 words in the title is negligible. 40 Design and implementation Figure 3.12: Normalized number of title words per class. 3.2. Classical methods 41 num words description : In the same fashion than the title, the amount of words of the description were extracted. Note that the text was cleaned from HTML tags. Likewise, after observing the distribution buckets were created, resulting in 26 classes. From 0 to 9 plus the 16 ranges shown in Table 3.5 . class range 10 [10, 15) 15 [15, 20) 20 [20, 30) 30 [30, 40) 40 [40, 50) 50 [50, 75) 75 [75, 100) 100 [100, 150) 150 [150, 200) 200 [200, 300) 300 [300, 400) 400 [400, 500) 500 [500, 1000) 1000 [1000, 2000) 2000 [2000, 3000) 3000 [3000, max) Table 3.5: Number of words in the description ranges. It can be observed in Figure 3.13 that spam records tend to contain a big amount of text in the description. However, there is a similar proportion of ham records in the ranges from 50 to 200. access right : This attribute was not modified, since it was already divided in classes and being a required field obliges both spam and ham records to have a non-null value. Zenodo is oriented towards open science, therefore the default value for this attribute is open. It can be seen in Figure 3.14 42 Design and implementation Figure 3.13: Normalized number of description words per class. 3.2. Classical methods 43 Figure 3.14: Normalized access right per class. that both ham and spam are almost identical with respect to the access right. text : This attribute contains the full text of the keywords, title and description. All punctuation marks and HTML tags have been removed. text 4000 : As it can be seen in Figure 3.13 the amount of records with more than 3000 words is small. The same happens for the title after 50 words (see Figure 3.12 ) and 10 for the keywords (see Figure 3.2 ). In addition, according to the literature the valuable information is contained on the beginning of the text corpus. Since no specific quantities were specified, sane values are chosen: 3500 for the description, 400 for the title, and 100 for the keywords. Moreover, having really large texts can become a problem when processing them as they will become even larger vectors. Full text cases of more than 25000 words were seen in the dataset. To tackle this problem the text in this feature has been reduced to a maximum of 4000 words, using the previously mentioned limits. Note that this limits leave a small margin compared to what was seen in the figures 3000, 50 and 10 respectively. 3.2.2 Random Forest based models Three models were trained one using the extracted features, one using the combination of the text corpus (keywords, title, description) in full length, and another one with the reduced to 4000 words corpus. In all cases the hyperparameters of the existing Zenodo classifier were used. 44 Design and implementation Figure 3.15: Model A confusion matrix. Figure 3.16: Model B confusion matrix. These are: 100 as number of estimators and 4 as number of jobs. For the random forest with extracted features as input, the variables were encoded using one-hot-encoding method, while the text based content was vectorised using TF-IDF in the same manner than on the existing classifier: with a total of 8000 features and an ngram range of (1, 1). English stop words were removed when creating the TF-IDF vectorization. For the sake of simplicity we will refer as model A to the one that uses the extracted features, model B to the full text based one, and model C to the 4000 words text model. The difference in accuracy for the ham class is small, model A obtained a 99.89% of accuracy while model B and C obtained a 99.97%. However, for the spam class model B and C perform significantly better with a 91.90% and 91.89% of accuracy against a 88.53%. The confusion matrices are shown in Figures 3.15 , 3.16 , and 3.17 . On the other hand the training and prediction times of model A are significantly better than model B and C. Model A took approximately 6 minutes to be trained and 2 seconds to 3.2. Classical methods 45 Figure 3.17: Model C confusion matrix. Model Ham Spam Training time Prediction time Feature extraction (A) 0.9989 0.8853 6.1min 2s Full text (B) 0.9998 0.9190 47.4min 25.3s Text 4000 (C) 0.9998 0.9189 45.2min 24.3s Table 3.6: Random Forests comparison. predict one third of the dataset, while model B and C took more than 45 minutes for training and 24 seconds for prediction. Note that the training time for model B and C includes the TF-IDF vectorization, which was 14 and 13 minutes respectively. Table 3.6 shows a summary of these three models. In order to verify levels of contamination, those records which were originally classified as ham but detected as spam by the three models were manually checked. The intersection consisted of 13 records, which resulted to be rightful content (a classification mistake made by all models). 3.2.3 Conclusions and future work The descriptive analysis showed that some knowledge gathered from supporters’ and devel- opers’ experience was correct (e.g. spam records contain a file). On the other hand, other attributes were shown to not separate between classes (e.g. access right ). However, they might do so when combined with others, and so they were kept on the produced dataset. Some of features that proved to be useful in the literature studies, like the number of words were ex- tracted. Nonetheless, many others could not be extracted due to the lack of attributes with the information. Such is the case of creation time which could lead to a time series based analysis, 46 Design and implementation and user related features. This led to the creation of new classifier based on Random Forest, which used the same hyperparameters. This new model (model A), obtained a similar accuracy to full text ones (model B and C) for ham class, but is 3.37% less accurate for the spam class. On the other hand, if training and prediction speed is an important aspect model A is many orders magnitude faster than model B and C. In consequence, a first spamicity check in pseudo real-time, as well as fast re-train, could be carried out using this new model (model A). It is important to note, that features are more easily learnt and faked than the text in the record, therefore spammers could more easily circumvent the new classifier. To conclude, model C (4000 words text Random Forest with TF-IDF vectorization) does not present significant differences compared to model B (full text Random Forest with TF-IDF vectorization). However, it does consumed less memory when stored and results in smaller vectors, and important fact for the Neural Network based classifiers. Therefore, it is the one that will be used from now on in this work. 3.3. Neural Networks 47 3.3 Neural Networks The length of the text did not have a great impact on the accuracy of the Random Forest models. However, it affects significantly the memory and time performance of the neural networks. Therefore, the 4000 words text will be used for this section. In addition, the dataset is highly unbalanced with approximately 37 thousand spam records and 1690 thousand ham records (1.6 million, 45 times the amount of spam). A simple NN with 2 dense layers and a small scale VGG network (2x8 plus 2x16 1D convolutional layers with kernel size 3, using max pooling with default pool size and a dropout of 0.1) were trained with whole dataset. Both of them obtained between 97% and 98% accuracy. However, when checking the accuracy per class, it could be seen that the spam class had almost a 0% accuracy, meaning that the model had learn only the ham class. This makes sense since 37K records are approximately a 2% of the full dataset (1.7M). To deal with this problem a new balanced dataset was generated, containing 37K records for each class (a total of 64K). The new balanced dataset was created by under sampling the majority class (ham). To be certain that there is no loss of information the Condensed Nearest Neighbors, or CNN for short, under sampling technique was tested. It is technique that seeks a subset of a collection of samples that results in no loss in model performance, referred to as a minimal consistent set. However, this process is very slow and not easily parallelizable, it was stopped after 4 days of run time. Therefore, the dataset was balanced using random under sampling. To make use of word embeddings the resulting dataset was encoded using a hashing tech- nique (called one hot encoding by Keras), which assigns a integer number to each different word in the vocabulary. In addition, the vectors were padded to have all the same length. To visualize the data points in 2D and 3D, this vectors were reduced to two and three components using PCA. The results can be seen in Figure 3.18 and Figure 3.19 . These figures show that both classes of records seem to be highly similar. Finally, the dataset was split into training, validation and test so all the networks use the same dataset and the results are comparable. The resulting datasets contain 47607, 5290, and 22671 records respectively. 3.3.1 Neural Networks form the literature Several network configuration based on the work from Jain et al. are tested on the balanced dataset. This includes convolutional neural networks [ 10 ], recurrent neural networks [ 11 ] and a combination of both of them [ 12 ]. Due to the multi-language nature of the dataset, an embedding layer of 200 dimensions was added to the networks, instead of using pre-trained ones. In addition the learning rate for the optimizers (Adagrad) was left with its default value 48 Design and implementation Figure 3.18: Balanced dataset 2D representation. Figure 3.19: Balanced dataset 3D representation. 3.3. Neural Networks 49 (0.1), so was the batch size (32). All models were trained for 10 epochs as stated in the literature. 3.3.1.1 CNN using word embeddings The configurations used by Jain et al. [ 10 ] are presented in Table 3.7 . These networks are built using Convolutional 1 Dimensional (Conv1D) filters. A B Number of filters 128 54 Filter length 5 4 Dropout 0.1 0.2 Activation function ReLu ReLu Optimizer Adagrad (lr 0.1) Adagrad (lr 0.1) Epochs 10 10 Table 3.7: Literature CNN Networks configuration. Both models were trained for 10 epochs. However they have margin to improve, since both validation and training accuracy keep increasing and there is no hard sign of overfitting. This is shown in Figures 3.20 and 3.21 . Figure 3.20: Literature configuration A CNN training accuracy and loss. In terms of performance compared to the Random Forests, both networks are trained in approximately half of the time (16min vs 30min). However, this networks are trained on a much smaller dataset, and its ability to generalize with the full dataset would need to be tested. 50 Design and implementation Figure 3.21: Literature configuration B CNN training accuracy and loss. A B Training time (avg per epoch) 116s 99s Training accuracy 95.1% 94.8% Test accuracy 95% 94.6% Table 3.8: Literature CNN Networks training metrics. Against the 22k records tests dataset mentioned at the beginning of the section, both models performed better than the feature extraction Random Forest for about 2%. However, it is worth noticing that contrary to the Random Forest models these neural networks do a better detecting spam records by 4% or 5%, loosing approximately the same amount of accuracy on ham records. Since one of the requirements is not to get false positives, the best performing model of the two is model B. It has a 1% precision for the Spam class. Ham Spam A Precision 0.93 0.97 Recall 0.97 0.93 F1 Score 0.95 0.95 B Precision 0.92 0.98 Recall 0.98 0.91 F1 Score 0.95 0.94 Table 3.9: Literature CNN Networks test metrics. 3.3. Neural Networks 51 3.3.1.2 CNN using TF-IDF vectorization In order to compare the usefulness of word embeddings the same two CNN configurations were tested on TF-IDF vectors. Both with full and 4000 words text, resulting in four models. However, the differences of the results between dataset were very similar or even slightly better when using the 4000 words one. For that reason the information presented below corresponds to the two models trained on that dataset. Contrary to the one hot encoded vectors, the TF-IDF ones show a certain distinction be- tween classes when reduced to two and three components using PCA. This is shown in Figures 3.22 and 3.23 . Figure 3.22: Balanced dataset 2D representation of TF-IDF vectorization. Both models were trained for 10 epochs. Configuration B seems to have reach the best possible performance on epoch 3, while configuration B seems to always overfit. This is shown in Figures 3.24 and 3.25 . In terms of performance, both networks are trained faster than any of the other methods (including the Random Forests) taking between 3 to 5 minutes. However the accuracy, is significantly lower, by more than 10% as shown in Table 3.10 . A B Training time (avg per epoch) 27s 20s Training accuracy 85.1% 85.2% Test accuracy 80.9% 78.4% Table 3.10: Literature CNN Networks with TF-IDF vectorization training metrics. It is worth mentioning that both networks manage to detect spam with a 100% accuracy, which could be due to overfitting. On the other hand, they reduce by more than 10% the 52 Design and implementation Figure 3.23: Balanced dataset 3D representation of TF-IDF vectorization. Figure 3.24: Literature configuration A with TF-IDF vectors training accuracy and loss. accuracy of the ham class. 3.3. Neural Networks 53 Figure 3.25: Literature configuration B TF-IDF vectors training accuracy and loss. Ham Spam A Precision 0.73 1.00 Recall 1.00 0.70 F1 Score 0.87 0.83 B Precision 0.70 1.00 Recall 1.00 0.56 F1 Score 0.82 0.72 Table 3.11: Literature CNN Networks with TF-IDF vectorization test metrics. 3.3.1.3 RNN using word embeddings The configurations used by Jain et al. [ 11 ] are presented in Table 3.12 . These networks are built using Long Short Term Memory (LSTM) units. A B Number of units 100 100 Dropout 0.1 0.2 Activation function Sigmoid Sigmoid Optimizer Adagrad (lr 0.1) Adagrad (lr 0.1) Epochs 10 10 Table 3.12: Literature RNN Networks configuration. Both models were trained for 10 epochs. As it can be seen in Figures 3.26 and 3.27 , the val- 54 Design and implementation idation accuracy drastically decreases (and the loss increases) around epoch 5 for configuration A and epoch 7 for configuration B. However, they seem to be local minimums since afterwards these values are above the training ones and therefore might get better results if trained during more epochs. Figure 3.26: Literature configuration A RNN training accuracy and loss. In terms of performance compared to the Random Forests, both networks are trained in approximately double of the time (60min vs 30min). Against the 22k records tests dataset mentioned at the beginning of the section and com- pared to the feature extraction Random Forest, configuration A does not give significant im- provements while configuration B performs better by about 2%. Focusing on configuration B, it is worth noticing that contrary to the Random Forest models these neural networks do a better job detecting spam records by 6%, loosing approximately the same amount of accuracy on ham records. In this case, configuration B is chosen. Figure 3.27: Literature configuration B RNN training accuracy and loss. 3.3. Neural Networks 55 A B Training time (avg per epoch) 369s 365s Training accuracy 92.5% 96% Test accuracy 92.3% 95.7% Table 3.13: Literature RNN Networks training metrics. Ham Spam A Precision 0.87 0.99 Recall 0.99 0.85 F1 Score 0.93 0.93 B Precision 0.93 0.99 Recall 0.99 0.92 F1 Score 0.96 0.96 Table 3.14: Literature RNN Networks test metrics. 3.3.1.4 Combining CNN, RNN and word embeddings The configurations used by Jain et al. [ 12 ] are presented in Table 3.15 . For short, from now on the mix configuration of convolutional and recurrent networks will referred to as CRNN. A B Number of filters (Conv1D) 128 54 Filter length (Conv1D) 5 4 Activation function ReLu ReLu Number of units (LSTM) 100 100 Dropout 0.1 0.2 Optimizer Adagrad (lr 0.1) Adagrad (lr 0.1) Epochs 10 10 Table 3.15: Literature CRNN Networks configuration. Both models were trained for 10 epochs. Configuration A might have reached the best performance on epoch 8 since afterwards it starts decreasing, although this might be a local 56 Design and implementation minimum as it happens on the 7th epoch of configuration B. Both models might achieve better results if trained for more epochs. This is shown in Figures 3.28 and 3.29 . Figure 3.28: Literature configuration A CRNN training accuracy and loss. In terms of performance compared to the Random Forests, both networks are trained in approximately one third more of time (40min vs 30min). However, this networks are trained on a much smaller dataset. A B Training time (avg per epoch) 250s 220s Training accuracy 97% 96.4% Test accuracy 96.5% 96% Table 3.16: Literature CRNN Networks training metrics. Figure 3.29: Literature configuration B CRNN training accuracy and loss. 3.3. Neural Networks 57 Against the 22k records tests dataset mentioned at the beginning of the section, both models performed better than the feature extraction Random Forest for about 3%. However, it is worth noticing that contrary to the Random Forest models these neural networks do a better job detecting spam records by 8% reaching the 99%, loosing only a 4% or 6% of accuracy on ham records. Since the precision for spam records is the same on both models, but configuration A performs better on the ham class this is the one chosen on this section. Ham Spam A Precision 0.95 0.99 Recall 0.99 0.94 F1 Score 0.97 0.96 B Precision 0.93 0.99 Recall 0.99 0.93 F1 Score 0.96 0.96 Table 3.17: Literature CRNN Networks test metrics. 3.3.1.5 Literature networks conclusions and future work In terms of vectorization, and even though is not perceptible to the human eye through 2D and 3D representations, word embeddings manage to interpret the one hot encoded vectors resulting on an improvement of around 10% accuracy. From all the tested configurations, the CRNN network with configuration A was the best performant one with 97% accuracy on training and 96.5% on testing, and a high 99% precision on the spam class, which means a small amount of false positives. Therefore, this network configuration will be further investigated in the following sections. On almost all the networks which used word embeddings, more epochs of training might be needed to reach their full potential. In addition, all networks had a better precision and recall on the spam class, this could be due to the different languages present on the content, and will be tested on Section CRNN on English-only content . 3.3.2 CRNN on English-only content The models presented in the previous section achieve a high accuracy for the spam records, but seem to be incapable of generalizing properly the ham class. One possible reason would be the multi-language nature of the data. The original dataset was highly unbalanced, and it possible 58 Design and implementation that when creating a balanced dataset using random selection on the majority class a few ham records were taken of a language that is not represented on the spam ones. 3.3.2.1 Language analysis The first step is to see the language distribution of the balanced dataset. This is done using the langdetect Python library. As it can be seen in Figure 3.30 the big majority of the content is in English, with ap- proximately 55K records, followed by French and German both of them with less than 5K records. Figure 3.30: Language distribution on the full balanced dataset. 3.3. Neural Networks 59 Figure 3.31: Language distribution on the ham records. However, in Figures 3.31 and 3.32 it can be seen that the distribution on non-English languages varies significantly between the two target classes (e.g. German records are mostly ham, and French records are mostly spam). Therefore, a new dataset with only English records was created. It contains 31018 ham and 25172 spam records (a total of 56190). Once again the PCA reduction was preformed to try to visually verify the difference between datasets. Figures 3.33 and 3.34 show that, as with the multi-language content, there is no evident differences. 60 Design and implementation Figure 3.32: Language distribution on the spam records. 3.3. Neural Networks 61 Figure 3.33: English dataset PCA reduction to 2D. Figure 3.34: English dataset PCA reduction to 3D. 62 Design and implementation Figure 3.35: CNN Configuration A on English content. Figure 3.36: CNN Configuration B on English content. 3.3.2.2 Running literature models on English content. On this case, while the CNN networks seem to need more epochs of training (Figure 3.35 and 3.36 ), the CRNN (Figure 3.39 and 3.40 ) seem to reach its limits on the 6th epoch. The RNN might still need more epochs (Figure 3.37 ). However, configuration B behaved erratically as shown in Figure 3.38 , most likely due to the increased dropout of this configuration or a fast learning rate (default, 0.1). Note that his model was run several times with similar results. In terms of precision, no model behaved better than the chosen one of the previous section (CRNN with configuration A). CNNs are between 3% and 4% less accurate. Although still not better than previous models, RNN improve their accuracy by approximately 10%. The closest one is CRNN with configuration B, which obtained the same results but a 0.01% less recall on the spam class. It is worth mentioning, that since the content was English only, stop words were removed. This reduced the size of the vectors to a third (from 4.3K to 1.5K in length). The final impact 3.3. Neural Networks 63 Figure 3.37: CRNN Configuration A on English content. Figure 3.38: CRNN Configuration B on English content. Figure 3.39: RNN Configuration A on English content. 64 Design and implementation Figure 3.40: RNN Configuration B on English content. was a speed up on all models, which were trained in between half and a third of the time than the multi-language ones. Overall, these models decreased in accuracy. Therefore, one could question if the stopwords actually make a difference between ham and spam. As a consequence the 6 models were trained over English only content including the stopwords. As in the previous models, the metrics per epoch show that more training could improve the accuracy. However, using the 10 epochs stated by the original papers, the accuracy on English content with stopwords only differ on the configuration A of the RNN and CRNN, which got a 6% and 1% increase on the ham class respectively. 3.3.2.3 Conclusions on English only content Since no model got significantly better results than in the previous section, it can be concluded that the language is not preventing the models to classify correctly that 3% or 4% missing, mostly on the ham class. There is a big chance that it is simply due to the lack of training data for such cases. Concerning a per-language classification, it would be interesting to see the usage of stop- words. A classification removing them or using them along with the text has been done. How- ever, what happens when only stopwords are used is still unclear. This could be, for example, a feature to extract an use in a Random Forest. 3.3.3 Deeper Neural Networks In the previous sections the results of two predefined configurations over the three types of networks (CNN, RNN and CNN+RNN) were shown. In many of those cases the training results showed that more epochs could result in a classification improvement. In addition, 3.3. Neural Networks 65 Figure 3.41: Deep NN configuration I metrics. Ham Spam Precision 0.95 0.99 Recall 0.99 0.94 F1 Score 0.97 0.96 Table 3.18: Performance metrics of configuration A CRNN. the two proposed configurations in the paper use a big amount of units for the LSTM and filters for the convolutional layers, which result in longer training time. In this section, network configurations with fewer units and filters, but more layers and epochs, different optimizers (Adam or Adagrad), and embedding sizes (200 and 300), are trained. For example, several VGG like configurations starting by 8, 16 or 32 filters; or networks as deep as 12 layers with 4 or 8 filters were tested. In most of the cases LSTM layers were added, however tests without these were also conducted. The full list of configurations is available in Annex A A . In general terms, deeper networks tend to reach the maximum accuracy in less epochs. Usually around the 5th epoch, as show in Figure 3.41 . However, they were trained during between 10 and 20 epochs, to discard the possibility of local minimums. In Table 3.19 the results obtained by the six best performant models are shown (Their configurations are shown in the Annex A A , along with the other 24 tested models). Table 3.18 shows the results obtained by the best performant configuration of Section 3.3.1 . As it can be seen the difference happens mostly on the ham class. All models except the last one (Model VI ) perform worse by between 1% and 3%. However, Model VI obtains the same results and improves the F1 Score of the spam class by 1%. In addition, this networks trains significantly faster, with an average of 210s per epoch against the 250s of the network 66 Design and implementation Model Precision Recall F1 Score Epochs Training time (per epoch) I Ham 0.93 0.99 0.96 20 215s Spam 0.99 0.93 0.96 II Ham 0.95 0.98 0.97 20 140s Spam 0.98 0.95 0.96 III Ham 0.95 0.99 0.97 20 170s Spam 0.98 0.95 0.97 IV Ham 0.95 0.98 0.97 20 220s Spam 0.98 0.95 0.97 V Ham 0.94 0.99 0.96 20 203s Spam 0.97 0.94 0.96 VI Ham 0.95 0.99 0.97 20 210s Spam 0.99 0.95 0.97 Table 3.19: Deeper Neural Networks metrics. proposed on the literature. The configuration of this model is similar to the VGG network, and is as follows: • Embedding layer of 300 dimensions. • Convolutional layer, with 16 filters of size 5 • Convolutional layer, with 16 filters of size 5 • Max pooling layer of size 2 • Convolutional layer, with 32 filters of size 5 • Convolutional layer, with 32 filters of size 5 • Max pooling layer of size 2 • Long Short Term Memory layer with 100 units • Dropout layer with 0.1 rate • Flatten layer • Dense layer with 2 units (one per targe class) 3.3. Neural Networks 67 Ham Spam Precision 1.00 0.06 Recall 0.71 0.87 F1 Score 0.83 0.12 Table 3.20: Performance metrics of configuration B CNN on the full dataset. Ham Spam Precision 0.99 0.06 Recall 0.73 0.83 F1 Score 0.84 0.12 Table 3.21: Performance metrics of configuration B RNN on the full dataset. 3.3.4 Testing the whole dataset The original dataset is highly imbalanced, and this had several negative aspects in the training of the neural networks (These problems are explained at the beginning of Neural Networks section (Section 3.3 ). Therefore, the ability to generalize of the obtained models needs to be tested on the full dataset. On the spam records, not much difference should appear since all of them were used on the balanced dataset. However, the class contains more than a 1.6 million records that were not used on the training. Firstly, we test the best models from the literature. Those are the CNN with configuration B, the RNN with configuration B and the CRNN with configuration A. Their results are shown in Tables 3.20 , 3.21 , and 3.22 respectively. In addition, the metrics for the best performant model of the deeper neural networks (Section 3.3.3 ) are shown in Table 3.23 . Finally, other models which obtained almost perfect results (100% precision), but there overfit was suspected were tested to observe their ability to generalize. This are shown in Tables Ham Spam Precision 1.00 0.06 Recall 0.70 0.85 F1 Score 0.82 0.11 Table 3.22: Performance metrics of configuration A CRNN on the full dataset. 68 Design and implementation Ham Spam Precision 0.99 0.06 Recall 0.72 0.77 F1 Score 0.84 0.11 Table 3.23: Performance metrics of deep NN model VI on the full dataset. Ham Spam Precision 0.99 0.11 Recall 0.92 0.47 F1 Score 0.95 0.18 Table 3.24: Performance metrics of deep NN model XXVIII on the full dataset. 3.24 and 3.25 . The corresponding model number is shown in the caption and its configuration can be seen in the Annex A . It can be seen that in all cases the precision of the spam class is very low, reaching a maximum of 11%. The recall seems to be also very low in the last two tested models. It can be concluded that whereas the models will predict the ham class successfully they will produce a very high amount of false positives. Ham Spam Precision 0.98 0.07 Recall 0.96 0.15 F1 Score 0.97 0.09 Table 3.25: Performance metrics of deep NN model XXX on the full dataset. Chapter 4 Final conclusions and future work The work done in this master thesis studies the suitability of neural networks to tackle the spam problem in general purpose institutional repositories, specifically Zenodo. As a first step, the usage of Random Forest based on the already existing spam classifier is tested. Achieving an almost perfect score of 99.98% accuracy for ham content, but a lower 91.90% for spam content. Taking around 45 minutes to train, and 25 seconds to predict one third (533k) of the records. Apart from setting the base ground for accuracy and time performance for the neural networks, these classifiers were trained also using a reduced text corpus, which proved that 4000 words are enough text to obtain an accurate classification. Nonetheless, it is possible that this number can be reduced, further work on this aspect would result in smaller vectors and faster training and predicting times, especially in the neural networks. Finally, features based on Zenodo’s supporters experience and literature were extracted generating a new dataset of categorical data. Another Random Forest based classifier was trained using this dataset. Although, the time performance was 7.5 times faster and the accuracy on the ham class remained the same, it decreased almost a 3% the accuracy for spam records. Concerning the neural networks, the state of the art showed that similar methods to those used successfully in computer vision could achieve good results also for natural language pro- cessing tasks, such as text classification. Many studies managed to predict spam on SMS, Twitter and other social networks and news content with high accuracy. Particularly, the work of Jain et al., who studied the use convolutional neural networks [ 10 ], recurrent neural networks [ 11 ] and a combination of both of them [ 12 ], achieving a 99.01% and 95.48% accuracy on SMS and Twitter datasets respectively. The models proposed by Jain et al. were tested on a balanced subset of Zenodo’s data, obtaining high results (99% precision) for the spam class and a bit lower (93 to 95% precision) for the ham class. In addition, the influence of language (e.g. English vs other languages) and stopwords was tested, and proved to not have a significant effect on the prediction performance. 69 70 Final conclusions and future work It could be observed that the model configurations proposed by Jain et al. contain a few layers with a big amount of units or filters. Therefore, models with fewer units or filters but deeper in layers were trained. Testing also different architectures (e.g. VGG like), optimizers (e.g. Adam, Adagrad) and embedding sizes (e.g. 200, 300). As a result, 30 configurations were trained. Obtaining similar results than Jain et al. being the only difference a 1% on the F1 Score of the spam class. However, this network trains significantly faster, with an average of 210s per epoch against the 250s of the network proposed in the literature. Finally, the literature models and the best deeper models were tested against the whole dataset. All of them showed poor performance. While the ham class achieved high precision and in some cases also high recall, the spam class precision was very low, in some cases close to 0. This means, that the amount of false positives is very high. Once again this is a consequence of the imbalanced nature of the dataset and the technique used to balance it. It is suspected that not all record clusters were represented in the undersampled set. There are several potential solutions: One would be to perform a clustering of the ham records and perform a uniformly distributed sampling of those (note that it was preferred, but due to time constraints it was not possible). Another option would be to train the models with more subsets of the ham records until a suitable result is obtained upon prediction. Yet another option would be to oversample the spam class to reach a balanced 1.6M records. However, this might result in an overfit of said class since the amount of original spam records is much smaller (45 times). 4.1 Conclusion Neural networks have proven to achieve good results (around 99% accuracy and precision) on a small subset of the dataset. However, they need to be trained with more data in order to be appropriate for a production environment. This comes with the challenge of obtaining a sensible balanced dataset with all types of records represented. Therefore, the production setup should still be the Random Forest. However, certain improvements could be made such as reduction of the number of words or the treatment of those (removal of punctuation marks, etc.). These are described in more details in Section 4.2 . 4.2 Future work There are several lines of work that can get out of this work. In terms of data processing and features extraction, it would be interesting to be able to correlate the data with user data such as: How old is the user in the system? How many records has it published? How may com- munities does it belong to? Where do the spammers come from (via e.g. IP geolocalisation)?. 4.2. Future work 71 Moreover, the creation time of the record would allow a time series analysis which could profile the time, and from that the location, of the spammers. Finally, the data could be enriched with checks for the veracity of external identifiers, such as ORCID, although this might be useful to the repository itself and should be done by it. Concerning the multi-language nature of the data, it would be interesting to analyze the presence of stopwords in the data itself. Answering to the question: Do spammers use more (or less) stopwords than legitimate publishers? This could become a new extracted features (e.g. number of stopwords present in the record). Regarding the suitability of the model for a production environment, as stated before the model is not ready since it gives a big amount of false positives. However, this effect could be minimized if used as part of an ensemble with a meta classifier taking the decision from the output of several classifiers. In times also getting back the correct result to train the NN model. Another alternative would be the used of a pipeline, using first the NN model and passing to the RF those who are positive; considering them as spam if both model give a result above a certain threshold (to be defined). Then again retro-feeding the model training with the correct result. Finally, external tools such as Google BERT 1 could be tested on the data; and even add it to the meta ensemble or the pipeline. 1 https://github.com/google-research/bert 72 Final conclusions and future work Chapter 5 Deployment in production Industry algorithms are often deployed on the cloud, using technologies such as Amazon Sage- Maker 1 , Google Cloud AI 2 , IBM Watson 3 or other similar products. Independently of the hardware or software stack used by these products, they all have common characteristics: • They expose a RESTful API to submit requests, such as the prediction of a record. • Scale along with the demand, in order to provide almost real-time responses, or return a promise and respond as fast as possible. Zenodo is deployed in CERN’s Data Center 4 , and therefore the classifier deployment must be done using CERN’s infrastructure. As a consequence, the available technology is reduced to GPU Virtual Machines on OpenStack 5 , with up to 32GB of memory, and containerized applications deployments on OpenShift 6 . Due to the restrictions presented by CERN’s infrastructure, the deployment will be done ad-hoc and the RESTful API will need to be developed. Keras provides a tutorial 7 on how to develop a RESTful API using Flask, the same technology on which Invenio and Zenodo are built-on and therefore it would be easy to integrate in the code base. This could be deployed as a containerized application on OpenShift, and make use of the built-in AutoScaler component 8 to be able to scale on-demand. The re-training of the model would be carried out on a GPU Virtual Machine on OpenStack 1 https://aws.amazon.com/es/sagemaker/ 2 https://cloud.google.com/ai-platform 3 https://www.ibm.com/cloud/machine-learning 4 https://about.zenodo.org/infrastructure/ 5 https://clouddocs.web.cern.ch/gpu/README.html 6 https://information-technology.web.cern.ch/services/paas-web-app 7 https://blog.keras.io/building-a-simple-keras-deep-learning-rest-api.html 8 https://docs.openshift.com/container-platform/3.9/dev g uide/pod a utoscaling.html 73 74 Deployment in production to be able to perform it in a considerable time. To monitor said training process Tensorflow- Board 9 could be used. 9 https://www.tensorflow.org/tensorboard Bibliography [1] Aliaksandr Barushka and Petr Hajek. Spam filtering using integrated distribution- based balancing approach and regularized deep neural networks. Applied Intelligence, 48(10):3538–3556, 2018. [2] Johannes M Bauer, Michel JG Van Eeten, Tithi Chattopadhyay, and Yuehua Wu. Itu study on the financial aspects of network security: Malware and spam. ICT Applications and Cybersecurity Division, International Telecommunication Union, Final Report, July, 2008. [3] Andras A Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. Spamrank–fully automatic link spam detection work in progress. In Proceedings of the first international workshop on adversarial information retrieval on the web, pages 1–14, 2005. [4] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, 2007. [5] James Clark, Irena Koprinska, and Josiah Poon. A neural network based approach to automated e-mail classification. In Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003), pages 702–705. IEEE, 2003. [6] Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann Lecun. Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781, 2016. [7] Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. Black-box generation of adver- sarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56. IEEE, 2018. [8] Kwang Leng Goh and Ashutosh Kumar Singh. Comprehensive literature review on machine learning structures for web spam classification. Procedia Computer Science, 70:434–441, 2015. 75 76 BIBLIOGRAPHY [9] Hongmei He, Tim Watson, Carsten Maple, J¨ orn Mehnen, and Ashutosh Tiwari. A new semantic attribute deep learning with a linguistic attribute hierarchy for spam detection. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 3862–3869. IEEE, 2017. [10] Gauri Jain, Manisha Sharma, and Basant Agarwal. Spam detection on social media using semantic convolutional neural network. International Journal of Knowledge Discovery in Bioinformatics (IJKDB), 8(1):12–26, 2018. [11] Gauri Jain, Manisha Sharma, and Basant Agarwal. Optimizing semantic lstm for spam detection. International Journal of Information Technology, 11(2):239–250, 2019. [12] Gauri Jain, Manisha Sharma, and Basant Agarwal. Spam detection in social media using convolutional and long short term memory neural network. Annals of Mathematics and Artificial Intelligence, 85(1):21–44, 2019. [13] R Jennings. Cost of spam is flattening, 2010. [14] Chris Kanich, Nicholas Weaver, Damon McCoy, Tristan Halvorson, Christian Kreibich, Kirill Levchenko, Vern Paxson, Geoffrey M Voelker, and Stefan Savage. Show me the money: Characterizing spam-advertised revenue. In USENIX Security Symposium, vol- ume 35, 2011. [15] Vijay Krishnan and Rashmi Raj. Web spam detection with anti-trust rank. In AIRWeb, volume 6, pages 37–40, 2006. [16] Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004. [17] Sreekanth Madisetty and Maunendra Sankar Desarkar. A neural network-based ensem- ble approach for spam detection in twitter. IEEE Transactions on Computational Social Systems, 5(4):973–984, 2018. [18] Juan Martinez-Romo and Lourdes Araujo. Web spam identification through language model analysis. In Proceedings of the 5th international workshop on adversarial information retrieval on the web, pages 21–28, 2009. [19] Michael Mccord and M Chuah. Spam detection on twitter using traditional classifiers. In international conference on Autonomic and trusted computing, pages 175–186. Springer, 2011. BIBLIOGRAPHY 77 [20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [21] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. [22] Harold Nguyen. 2013 state of social media spam. Publication of NexGate, USA, from websites http://nexgate.com/wpcontent/uploads/2013/09/Nexgate-2013-State- of-Social-Media-Spam-Research-Report.pdf, 2013. [23] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [24] D Karthika Renuka, T Hamsapriya, M Raja Chakkaravarthi, and P Lakshmi Surya. Spam classification based on supervised learning using machine learning techniques. In 2011 In- ternational Conference on Process Automation, Control and Computing, pages 1–7. IEEE, 2011. [25] Pradeep Kumar Roy, Jyoti Prakash Singh, and Snehasish Banerjee. Deep learning to filter sms spam. Future Generation Computer Systems, 102:524–533, 2020. [26] Nikita Spirin and Jiawei Han. Survey on web spam detection: principles and algorithms. ACM SIGKDD explorations newsletter, 13(2):50–64, 2012. [27] S Sumathi and Ganesh Kumar Pugalendhi. Cognition based spam mail text analysis using combined approach of deep neural network classifier and random forest. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020. [28] Krysta M Svore, Qiang Wu, Chris JC Burges, and Aaswath Raman. Improving web spam classification using rank-time features. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, pages 9–16, 2007. [29] Dave C Trudgian. Spam classification using nearest neighbour techniques. In Interna- tional Conference on Intelligent Data Engineering and Automated Learning, pages 578– 585. Springer, 2004. [30] Onur Varol, Emilio Ferrara, Clayton A Davis, Filippo Menczer, and Alessandro Flam- mini. Online human-bot interactions: Detection, estimation, and characterization. arXiv preprint arXiv:1703.03107, 2017. 78 BIBLIOGRAPHY [31] Alex Hai Wang. Don’t follow me: Spam detection in twitter. In 2010 international conference on security and cryptography (SECRYPT), pages 1–10. IEEE, 2010. [32] Tingmin Wu, Shigang Liu, Jun Zhang, and Yang Xiang. Twitter spam detection based on deep learning. In Proceedings of the australasian computer science week multiconference, pages 1–8, 2017. [33] Sihong Xie, Guan Wang, Shuyang Lin, and Philip S Yu. Review spam detection via tem- poral pattern discovery. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 823–831, 2012. [34] Bo Yu and Zong-ben Xu. A comparative study for content-based dynamic spam classifi- cation using four machine learning algorithms. Knowledge-Based Systems, 21(4):355–362, 2008. Appendix A Deep Neural Networks configuration. Model I • Embedding, 300 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • Conv1D, 64 filters, size 5, activation ReLu • Conv1D, 64 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • LSTM, 100 units • LSTM, 100 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax 79 80 Deep Neural Networks configuration. Model II • Embedding, 200 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 64 filters, size 5, activation ReLu • Conv1D, 64 filters, size 5, activation ReLu • Conv1D, 64 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 50 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Model III • Embedding, 200 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu 81 • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 100 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Model IV • Embedding, 200 dimensions • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 100 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Model V • Embedding, 200 dimensions • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu 82 Deep Neural Networks configuration. • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 50 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Model VI • Embedding, 300 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 100 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Model VII • Embedding, 300 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 83 • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model VIII • Embedding, 300 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D 84 Deep Neural Networks configuration. • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model IX • Embedding, 300 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model X • Embedding, 250 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XI • Embedding, 250 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu 85 • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XII • Embedding, 250 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 50 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XIII • Embedding, 300 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 50 units • Dropout, rate 0.15 • Flatten 86 Deep Neural Networks configuration. • Dense, 2 units, activation softmax Model XIV • Embedding, 200 dimensions • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 32 filters, size 5, activation ReLu • Conv1D, 32 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 30 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XV • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 30 units • Dropout, rate 0.2 87 • Flatten • Dense, 2 units, activation softmax Model XVI • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 50 units • Dropout, rate 0.15 • Flatten • Dense, 2 units, activation softmax Model XVII • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 10 units • LSTM, 10 units • Dropout, rate 0.15 • Flatten 88 Deep Neural Networks configuration. • Dense, 2 units, activation softmax Model XVIII • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 25 units • LSTM, 25 units • Dropout, rate 0.15 • Flatten • Dense, 2 units, activation softmax Model XIX • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • LSTM, 25 units • Dropout, rate 0.15 • Flatten • Dense, 2 units, activation softmax Model XX 89 • Embedding, 200 dimensions • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • Conv1D, 8 filters, size 3, activation ReLu • Conv1D, 8 filters, size 3, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • LSTM, 25 units • Dropout, rate 0.15 • Flatten • Dense, 2 units, activation softmax Model XXI • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 90 Deep Neural Networks configuration. • LSTM, 50 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXII • Embedding, 200 dimensions • Conv1D, 8 filters, size 5, activation ReLu • Conv1D, 8 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 16 filters, size 5, activation ReLu • Conv1D, 16 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • LSTM, 50 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXIII • Embedding, 200 dimensions • Conv1D, 8 filters, size 3, activation ReLu • Conv1D, 8 filters, size 3, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 16 filters, size 3, activation ReLu • Conv1D, 16 filters, size 3, activation ReLu • MaxPooling1D 91 • Dropout, rate 0.2 • LSTM, 50 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXIV • Embedding, 200 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXV • Embedding, 200 dimensions • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu 92 Deep Neural Networks configuration. • Conv1D, 4 filters, size 5, activation ReLu • Conv1D, 4 filters, size 5, activation ReLu • MaxPooling1D • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXVI • Embedding, 200 dimensions • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • Conv1D, 4 filters, size 3, activation ReLu • MaxPooling1D • Dropout, rate 0.1 • LSTM, 20 units • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXVII • Embedding, 200 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D 93 • Dropout, rate 0.2 • Flatten • Dense, 2 units, activation softmax Model XXVIII • Embedding, 200 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.25 • Flatten • Dense, 2 units, activation softmax Model XXIX • Embedding, 300 dimensions • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • Conv1D, 4 filters, size 10, activation ReLu • MaxPooling1D • Dropout, rate 0.25 • Flatten • Dense, 2 units, activation softmax Model XXX • Embedding, 300 dimensions • Conv1D, 8 filters, size 8, activation ReLu • Conv1D, 8 filters, size 8, activation ReLu 94 Deep Neural Networks configuration. • MaxPooling1D • Dropout, rate 0.2 • LSTM, 10 units • Dropout, rate 0.1 • Flatten • Dense, 2 units, activation softmax Download 1.75 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling