With this practical book, you’ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology co...

Buy Now From Amazon

Product Review

With this practical book, you’ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service.

Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you’re a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you.

  • Monitor every component of your application stack, from the network to user experience
  • Learn how to draw the right conclusions from the metrics you obtain
  • Develop a robust alerting system that can identify problematic anomalies—without raising false alarms
  • Address system failures by their impact on resource utilization and user experience
  • Plan an alerting configuration that scales with your expanding network
  • Learn how to choose appropriate maintenance times automatically
  • Develop a work environment that fosters flexibility and adaptability


  • Used Book in Good Condition

Similar Products

Practical Monitoring: Effective Strategies for the Real WorldSite Reliability Engineering: How Google Runs Production SystemsDesigning Distributed Systems: Patterns and Paradigms for Scalable, Reliable ServicesKubernetes: Up and Running: Dive into the Future of InfrastructureThe Site Reliability Workbook: Practical Ways to Implement SREDatabase Reliability Engineering: Designing and Operating Resilient Database SystemsThe DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology OrganizationsPrometheus: Up & Running: Infrastructure and Application Performance MonitoringClean Architecture: A Craftsman's Guide to Software Structure and Design (Robert C. Martin Series)