Machine Learning Methods for Classification Textual Information

Krivosheev N.A.; Spitsyn V.G.

Abstract:

A method for classifying textual information based on the apparatus of convolutional neural networks is considered. The text preprocessing algorithm is presented. Text preprocessing consists of: lemmatizing words, removing stop words, processing text characters, etc. The word-by-word conversion of the text into dense vectors is performed. Testing is carried out on the basis of the text data of 'The 20 Newsgroups'. This sample contains a collection of approximately 20,000 news stories in English, which is divided (approximately) evenly between 20 different categories. The accuracy of the best convolutional neural network used in this work on the test set was ~ 74%. The topology of the best neural network is given. The accuracy of voting of neural networks by the Bagging algorithm was ~ 81.5%. Based on a review of similar solutions, a comparison is made with the following text classification algorithms: the support vector method (SVM, 82.84%), the naive Bayes classifier (81%), the k nearest neighbors algorithm (75.93%), and the word bag.

Keywords:

neural networks, Bagging, text classification, database “The 20 Newsgroups”

Publication language: russian, pages: 4 (p. 266-269)

Russian source text:

List of publications citation:

Export link to publication in format:

About authors:

Krivosheev Nikolay Anatolyevich, , National Research Tomsk Polytechnic University

Spitsyn Vladimir Grigorievich, , National Research Tomsk Polytechnic University