Software design & coding-notes: Detect duplicates among billions of data

Sunday, December 6, 2020

Detect duplicates among billions of data

Data can be URLs.

HashMap.

If query strings yield the same result. Need to compare with hash of partial content/full content.

HDFS. URL is key. Indexed field is partial content hash. Similar to Web crawler.

Data can be docs. Again similar to web crawler.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)