Web pages are many. Purpose? Search, copyright violation, plagiarim?
CA, AP, CP: Need P. AP or CP better. Suggest CP, since background process.
NoSQL will be expensive.
Choose HDFS.
URL is key.
1st set of crawlers: find URL & linked URLs.
2nd set of crawlers: go to URL & download. Before downloading, check hash is already downloaded. Set key that downloading has started to avoid a different crawler from starting. Store time in case download fails.
Store content of URL (compressed) & hash with the key.
Large web pages: hash a portion of the web page. Strip any ad code through regexes & hash a part of the remaining page. Store. Find duplicates to avoid infinite loops. If duplicate is found, generate a full hash of both docs & store.
Index small & large hash.
No comments:
Post a Comment