Sunday, December 6, 2020

Detect duplicates among billions of data

Data can be URLs.

HashMap.

If query strings yield the same result. Need to compare with hash of partial content/full content.

HDFS. URL is key. Indexed field is partial content hash. Similar to Web crawler.

Data can be docs. Again similar to web crawler.

No comments:

Post a Comment

Free AI Chat tools

https://grok.com https://x.com/i/grok https://chatgpt.com https://copilot.microsoft.com https://chat.deepseek.com https://www.meta.ai https:...