StructScraper – a tool for dynamically incorporating semantic data of external web resources into web page content
Abstract:
Web data extraction (web scraping) is a popular and, at the same time, very difficult task due to the poor structure of documents posted on the Web. The presence of semantic markup simplifies web data extraction, however, the available tools used for this purpose require programming to include the extracted data in the content of the web page and have some drawbacks that make them inconvenient if the task is to include data from several sources. The StructScraper tool described in this work allows one to add data from various sources extracted from popular types of semantic markup: “microdata” and JSON-LD, as well as metadata contained in tags of html documents and properties of Word documents and PDF files when the web page is loading. Its use does not require programming – only knowledge of HTML and CSS is required. The tool can be useful when creating pages with contact details of organizations, with prices for the same product in different online stores, for adding meta-information to hyperlinks, etc.