Conference material: "Proceedings of the International Conference on Computer Graphics and Vision “Graphicon” (23-26 September 2019, Bryansk)"
Authors:Krivosheev N.A., Spitsyn V.G.
Machine Learning Methods for Classification Textual Information
Abstract:
A method for classifying textual information based on the apparatus of convolutional neural networks is considered. The text preprocessing algorithm is presented. Text preprocessing consists of: lemmatizing words, removing stop words, processing text characters, etc. The word-by-word conversion of the text into dense vectors is performed. Testing is carried out on the basis of the text data of 'The 20 Newsgroups'. This sample contains a collection of approximately 20,000 news stories in English, which is divided (approximately) evenly between 20 different categories. The accuracy of the best convolutional neural network used in this work on the test set was ~ 74%. The topology of the best neural network is given. The accuracy of voting of neural networks by the Bagging algorithm was ~ 81.5%. Based on a review of similar solutions, a comparison is made with the following text classification algorithms: the support vector method (SVM, 82.84%), the naive Bayes classifier (81%), the k nearest neighbors algorithm (75.93%), and the word bag.
Keywords:
neural networks, Bagging, text classification, database “The 20 Newsgroups”