Sunday, December 6, 2020

Web crawler

Web pages are many. Purpose? Search, copyright violation, plagiarim?

CA, AP, CP: Need P. AP or CP better. Suggest CP, since background process.

NoSQL will be expensive.

Choose HDFS.

URL is key. 

1st set of crawlers: find URL & linked URLs.

2nd set of crawlers: go to URL & download. Before downloading, check hash is already downloaded. Set key that downloading has started to avoid a different crawler from starting. Store time in case download fails.

Store content of URL (compressed) & hash with the key. 

Large web pages: hash a portion of the web page. Strip any ad code through regexes & hash a part of the remaining page. Store. Find duplicates to avoid infinite loops. If duplicate is found, generate a full hash of both docs & store.

Index small & large hash.

No comments:

Post a Comment

Free AI Chat tools

https://grok.com https://x.com/i/grok https://chatgpt.com https://copilot.microsoft.com https://chat.deepseek.com https://www.meta.ai https:...