Software design & coding-notes: December 2020

Wednesday, December 16, 2020

API Gateway-how to choose-notes

https://biztechswami.blogspot.com/2020/12/microservices-api-gateway.html

Every team can have different microservices. Every product can be different.

How to make it consistent & uniform? Rate limiting, throttling, manage APIs, set limits.

Auth: API gateway.

Kong: OS, good features but need an expert. Operations team needs expertise & troubleshooting. Buy support. With premium contract, almost as expensive as AWS API Gateway.

If customer requirements are different, portability required, choose FOSS. Run anywhere.

Most APIs that don't need an API gateway: AWS ALB (App Load Balancer).

Thursday, December 10, 2020

Kafka vs Flume

Kafka: pub-sub mechanism, durable distributed store, producer consumer.

Messages published to topics & read with offsets.

Publishers: post to topics. Consumers: not overwhelmed.

RF: 2 also supported.

HDFS consumer possible. No processing possible in Kafka.

Flume: push mechanism. Memory & file channels. Memory not durable but fast.

Source & sink. Push to channel which pushes to sinks.

Many consumers: Flume topology: add channels. Memory to file possible. File channels to HDFS or HBase or Cassandra.

Adv: optimized for HDFS. HDFS sink part of same ecosystem. Data processing possible in topology, such as PARQ parquet columnar format, instream transform.

Flafka: Flume to Kafka

Both can coexist.

Monday, December 7, 2020

CAP & Graph-notes

CA: Ensured with replicas. SQL, Neo4J.

CP: Data is sacrosanct. Even 1 node down-error.

AP: Write to one node is fine. Even with a node is down. Quorum in Cassandra-increases C.

CA NoSQL: Berkeley, Redis (in-memory): High performance owing to no schema. Immediately consistent & available: only 1 copy. Async replica & failover.

NoSQL relationships: Usually no.

Graph db: relationships query faster.

NoSQL is fundamentally sharding.

Availability: Want to write? If node not available, error. Such as with CP.

NoSQL relationships?

How to do relationships in NoSQL?

Composite key. Secondary indexes if available.

CA: When to use SQL vs NoSQL?

NoSQL: Berkeley DB, Riak, Redis. When high performance is needed.

Vs CA SQL. When relations & the SQL ecosystem is more important. SQL: drops consistency with read replicas when performance is important.

How to implement a smart graph?

Do a full traversal. DFS/BFS are both ok.

If performance not an issue, maintain data in node for max number of connections to other machines.

Create a HashMap. Key is maximum number of node connections to other machines. Value is the node.

Find a node with the highest connections. Verify connections. Move the node to the machine with the highest connections. Update the Hashmap for this node. Repeat.

Handle conditions of one machine getting full.

For Supernodes, max # of nodes.

Optimization:

Maintain times of access of node connection. Instead of # of nodes, have weighted average of nodes being accessed on another machine. Move those nodes.

Hashing function

1:1 if input range is small.

many:1 for large input range.

i%(size of hash table) for example.

Collisions:

Open hashing:

Chaining: Store as a linked list at the location for O(n) lookups.

Store as B-Tree for O(log(n)) lookups.

Closed hashing (within space allocated):

Linear: Keep increasing index until you find an empty spot. Can cause crowding.

Quadratic: Keep exponentially increasing index until you find an empty spot.

Sunday, December 6, 2020

Learning Software Design & Architecture-notes

https://pjaylives.wordpress.com/2021/04/02/eng-prep/

https://www.youtube.com/channel/UCn1XnDWhsLS5URXTi5wtFTA

https://www.youtube.com/playlist?list=PLGG3jh_5Rqx4wbX6oILwAItHdG7QP-_xq

https://youtu.be/PE4gwstWhmc

https://www.youtube.com/user/tusharroy2525

https://www.educative.io/courses/grokking-the-coding-interview?affiliate_id=5082902844932096&utm_source=CPC&utm_medium=Levels%20FYI&utm_campaign=Grokking%20the%20Coding%20Interview%20US

https://www.algoexpert.io/systems/product

https://www.youtube.com/watch?v=q0KGYwNbf-0

Coursera system design course

https://blog.pragmaticengineer.com/preparing-for-the-systems-design-and-coding-interviews/

Youtube:

Gaurav Sen

Tushar Roy

Huge ecommerce website: show sales rank by category

Medium e-commerce: Use SQL normalized tables. Index by product id & category.

Huge: Move to NoSQL. Composite index (partition key & sort key) by product id & category. Or secondary indexes. Or denormalized.

In CA, AP, CP, choose AP: eg: Cassandra or DynamoDB.

For lower cost, use Hadoop with indexes on product id & category.

Design cache for an expensive API

Figure out a API request hash. Hash the value. When to expire?

If possible, store in memory. CA KV like Redis. Api Request hash+Time is Key.

Detect duplicates among billions of data

Data can be URLs.

HashMap.

If query strings yield the same result. Need to compare with hash of partial content/full content.

HDFS. URL is key. Indexed field is partial content hash. Similar to Web crawler.

Data can be docs. Again similar to web crawler.

Web crawler

Web pages are many. Purpose? Search, copyright violation, plagiarim?

CA, AP, CP: Need P. AP or CP better. Suggest CP, since background process.

NoSQL will be expensive.

Choose HDFS.

URL is key.

1st set of crawlers: find URL & linked URLs.

2nd set of crawlers: go to URL & download. Before downloading, check hash is already downloaded. Set key that downloading has started to avoid a different crawler from starting. Store time in case download fails.

Store content of URL (compressed) & hash with the key.

Large web pages: hash a portion of the web page. Strip any ad code through regexes & hash a part of the remaining page. Store. Find duplicates to avoid infinite loops. If duplicate is found, generate a full hash of both docs & store.

Index small & large hash.

Social network: shortest path & degrees of separation

2 way BFS search on graph.

How to scale?

Graph DB

Neo4J/: Sharding. Network hops between machines costly.

Randomized vs domain based.

Domain: Geographical regions for people. Tags for blogs. Category for products.

ArangoDB: Smart graphs: GC implemented to move edges together on the same machine.

Titan DB: Allows CA (Berkeley), AP (Cassandra) & CA (HBase).

Supernodes: split up followers based off some attribute. Add metadata on attributes to filter search.

End of day stock prices

Need stock history? How many stocks? Exchanges across the world? Just stocks or MFs, bonds?

Time of market close: assume query.

How many users? Any scalability requirements?

Any future analysis?

CAP theorem.

CA: normalized SQL with REST API for security.

Or NoSQL with composite key of stock symbol, user id & date. (CA, CP or AP depending on requirements).

Or HDFS with indexed keys as user id & stock symbol.

If needing stocks by category to suggest & want to know what friends are following, graph db linking users and/or stocks.

Software design & coding-notes