Software design & coding-notes: Web crawler

Sunday, December 6, 2020

Web crawler

Web pages are many. Purpose? Search, copyright violation, plagiarim?

CA, AP, CP: Need P. AP or CP better. Suggest CP, since background process.

NoSQL will be expensive.

Choose HDFS.

URL is key.

1st set of crawlers: find URL & linked URLs.

2nd set of crawlers: go to URL & download. Before downloading, check hash is already downloaded. Set key that downloading has started to avoid a different crawler from starting. Store time in case download fails.

Store content of URL (compressed) & hash with the key.

Large web pages: hash a portion of the web page. Strip any ad code through regexes & hash a part of the remaining page. Store. Find duplicates to avoid infinite loops. If duplicate is found, generate a full hash of both docs & store.

Index small & large hash.

Software design & coding-notes

Sunday, December 6, 2020

Web crawler

No comments:

Post a Comment

Free AI Chat tools

Report Abuse