Data can be URLs.
HashMap.
If query strings yield the same result. Need to compare with hash of partial content/full content.
HDFS. URL is key. Indexed field is partial content hash. Similar to Web crawler.
Data can be docs. Again similar to web crawler.
No comments:
Post a Comment