по
Journal Menu
> Issues > Rubrics > About journal > Authors > About the Journal > Requirements for publication > Council of Editors > List of peer reviewers > Review procedure > Policy of publication. Aims & Scope. > Article retraction > Ethics > Legal information
Journals in science databases
About the Journal

Публикация за 72 часа - теперь это реальность!
При необходимости издательство предоставляет авторам услугу сверхсрочной полноценной публикации. Уже через 72 часа статья появляется в числе опубликованных на сайте издательства с DOI и номерами страниц.
По первому требованию предоставляем все подтверждающие публикацию документы!
MAIN PAGE > Back to contents
Detection methods for web resources automated data collection
Menshchikov Alexander Alexeevich

graduate student, Saint Petersburg State University of Information Technologies

197101, Russia, Sankt-Peterburg, g. Saint Petersburg, Kronverkskii Prospekt, 49

menshikov@corp.ifmo.ru
Другие публикации этого автора
 

 
Gatchin Yurii

Doctor of Technical Science

Professor, Saint Petersburg State University of Information Technologies

197101, Russia, Sankt-Peterburg, Kronverkskii Prospekt, 49

gatchin@mail.ifmo.ru
Другие публикации этого автора
 

 

Abstract.

The article deals with the problem of automated data collection from web-resources. The authors present a classification of detection methods taking into account modern approaches. The article shows an analysis of existing methods for detection and countering web robots. The authors study the possibilities and limitations of combining methods. To date, there is no open system of web robots detection that would be suitable for use in real conditions. Therefore the development of an integrated system, that would include a variety of methods, techniques and approaches, is an urgent task. To solve this problem the authors developed a software product – prototype of such detection system. The system was tested on real data. The theoretical significance of this study is in the development of the current trend in the domestic segment, making a system of web robots detection based on the latest methods and the improvement of global best practices. Applied significance is in creation of a database for the development of demanded and promising software.

Keywords: web-robots, information gathering, parsing, web robot detection, web security, information security, information protection, intrusion detection, intrusion prevention, weblogs analysis

DOI:

10.7256/2306-4196.2015.5.16589

Article was received:

08-10-2015


Review date:

09-10-2015


Publish date:

27-11-2015


This article written in Russian. You can find full text of article in Russian here .

References
1.
 Otchet kompanii scrapesentry [Elektronnyi recurs]. – Rezhim dostupa: https://www.scrapesentry.com/scrapesentry-scraping-threat-report-2015/, svobodnyi (data obrashcheniya: 08.10.2015).
2.
I. A. Adegbola, R. G. Jimoh Spambot Detection: A Review of Techniques and Trends // International Journal of Applied Information Systems. – 2014. – V. 6(9).
3.
 Otchet kompanii distil networks [Elektronnyi recurs]. – Rezhim dostupa: http://resources.distilnetworks.com/h/i/81324486-2015-bad-bot-landscape-report/185088/, svobodnyi (data obrashcheniya: 08.10.2015).
4.
 MC. Calzarossa, L. Massari, D. Tessera An extensive study of Web robots traffic // Proceedings of International Conference on Information Integration and Web-based Applications & Services. – 2013.
5.
Menshchikov A. A., Shniperov A. N. Metod skrytogo vstraivaniya informatsii v vektornye izobrazheniya// Doklady TUSUR . 2015. №1 (35). S.100-106.
6.
 Robots Exclusion Protocol Guide [Elektronnyi recurs]. – Rezhim dostupa: http://www.bruceclay.com/seo/robots-exclusion-guide.pdf, svobodnyi (data obrashcheniya: 08.10.2015).
7.
 V. Almeida, D. A. Menasce, R. Riedi, F. P. Ribeiro, R. Fonseca, W. Meira, Jr. Analyzing Web robots and their impact on caching // Proc. Sixth Workshop on Web Caching and Content Distribution. – 2001. – P. 299–310.
8.
 D. Derek, S. Gokhale Web robot detection techniques: overview and limitations // Data Mining and Knowledge Discovery. – 2011. – V. 22(1). – P. 183–210.
9.
 T. Pang-Ning, K. Vipin Discovery of Web Robot Sessions Based On their Navigational Patterns // Data Mining and Knowledge Discovery. – 2002. – V. 6(1). – P. 9–35.
10.
 D. Derek, S. Gokhale A Classification Framework for Web Robots // Journal of American Society of Information Science and Technology. – 2012. – V. 63. – P. 2549–2554.
11.
 D. Derek, S. Gokhale Discovering New Trends in Web Robot Traffic Through Functional Classification // Proc. IEEE International Symposium on Network Computing and Applications. – 2008. – P. 275–278.
12.
 J. Lee, S. Cha, D. Lee, H. Lee Classification of web robots: An empirical study based on over one billion requests // Computers and security. – 2009. – V. 28. – P. 795–802.
13.
 B. Quan, X. Gang, Z. Yong, H. Longtao Analysis and Detection of Bogus Behavior in Web Crawler Measurement // Procedia Computer Science. – 2014. – V. 31. – P. 1084–1091.
14.
 D. Derek, S. Gokhale Detecting Web Robots Using Resource Request Patterns // Procceeding 11th International Conference on Machine Learning and Applications. – 2012. – V. 1. – P. 7–12.
15.
 D. Derek, K. Morillo, S. Gokhale A comparison of Web robot and human requests // Advances in Social Networks Analysis and Mining. – 2013. – P.1374–1380.
16.
 S. Kwon, YG. Kim, S. Cha Web robot detection based on pattern-matching technique // Journal of Information Science. – 2012. – V. 38(2). – P. 118–126.
17.
 G. Jacob, E. Kirda, C. Kruegel, G. Vigna PUB CRAWL: Protecting Users and Businesses from CRAWLers // Proceeding Security'12 Proceedings of the 21st USENIX conference on Security symposium. – 2012. – P. 25–36.
18.
 TH. Sardar, Z. Ansari Detection and Confirmation of Web Robot Requests for Cleaning the Voluminous Web Log Data // Proceeding International Conference on the IMpact of E-Technology on US. – 2014. – V. 28. – P. 795–802.
19.
 DS. Sisodia, S. Verma, OP. Vyas Agglomerative Approach for Identification and Elimination of Web Robots from Web Server Logs to Extract Knowledge about Actual Visitors // Journal of Data Analysis and Information Processing. – 2015. – V. 3. – P. 1–10.
20.
 BT. Loo, O. Cooper, S. Krishnamurthy Distributed Web Crawling over DHTs // University of California, Berkeley Department of Electrical Engineering and Computer Sciences. – 2004.
21.
Gatchin Yu.A. Teoriya informatsionnoi bezopasnosti i metodologiya zashchity informatsii/Yu.A. Gatchin, V.V. Sukhostat.-SPb.: SPbGU ITMO, 2010.-98 s.
22.
Korobeinikov A.G., Kutuzov I.M., Kolesnikov P.Yu. Analiz metodov obfuskatsii // Kibernetika i programmirovanie.-2012.-1.-C. 31-37. URL: http://www.e-notabene.ru/kp/article_13858.html
23.
Korobeinikov A.G., Grishentsev A.Yu. Uvelichenie skorosti skhodimosti metoda konechnykh raznostei na osnove ispol'zovaniya promezhutochnogo resheniya // Kibernetika i programmirovanie.-2012.-2.-C. 38-46. URL: http://www.e-notabene.ru/kp/article_13864.htm
Link to this article

You can simply select and copy link from below text field.


Other our sites:
Official Website of NOTA BENE / Aurora Group s.r.o.
"History Illustrated" Website