Speaker: FILL THIS IN LATER
(came into this one 5 minutes late, but)
-
Data normalization at edges
- pass stuff in as either class objects or dictionaries
- if you know all the keys in the dictionary, IT SHOULD BE A CLASS
-
Expect failure. How to plan to operate gracefully with failure?
- SILENT FAILURES ARE BAD. You need to have monitoring.
- Two main types of failures:
- Local (raises exceptions. Nice!)
- Remote (timeouts)
- Circuit breaker pattern:
- A local proxy b/w your local system and the remote
- closed --> open cnxn on failure, with periodic 'tests' to see if the remote is back up yet
- Circuit breaker pattern:
-
Redundancy Have 2 of everything, unless it's an important system. Then, have 3.
E.g. all hitting same DNS provider, or putting all your data in the same availability zone/server farm/etc.
BACKUPS! You also MUST test your backups.
-
Docs Write down how you do stuff, so other people can do it, too. Also: emergency runbooks/checklists are things you should have.
-
Deal with it (when all else fails)
- Don't make it any worse (cascading failures)
- e.g. retries-- use exponential backoffs with jitter (don't DDOS the endpoint)
- combinatorial request explosion, which is a real threat w/ microservices systems w/ many layers of dependencies
- e.g. retries-- use exponential backoffs with jitter (don't DDOS the endpoint)
- Don't make it any worse (cascading failures)
-
Don't swallow errors
-
Nothing should ever let errors pass silently.
-
A good pattern, if you have app-specific exceptions: (roughly, I didn't quite get it all)
try:
foo...
except Exception as e:
raise AppException() from e
- Don't try too hard to muddle on in the face of failure.
sys.exit(1)is a fine response to a failure. - fail fast & loudly.
- Recovery: make it fast MTTR: Mean time to recovery
Keep grubby human hands completely OFF of service restorations.
- Build fault-tolerant systems
- Turn off your phone.