Enhanced user experience is one of the top goals for every SaaS provider. How do you ensure that your customers are experiencing the value you intended your applications to deliver?
“Customer experience is the next competitive battleground.”
– Jerry Gregoire (the founder and Chairman of Redbird Flight Simulations, Inc.)
Having real-time, non-real-time and partially real-time metrics at your fingertips may help you to stay ahead of expected and unexpected challenges regarding your service performance, business, and operations.
Why do startups need to monitor their well-tested applications?
Earlier, the software was delivered (or “shipped”) to the customers through a CD and that was the extent of it. The job of the SaaS Provider was completed. However, since the advent of the SaaS and the cloud services, the software developers have visual access to more information about the delivered application, for instance, how customers are using it, which feature is popular with the user, which feature is unpopular among the users, and much more.
This allows the SaaS providers to leave the hardware mindset behind, they no longer have to ‘build it to never break’. Instead, the application monitoring allows SaaS providers to make customer satisfaction an iterative process for a product. Why? Because application monitoring allows you to not only track the performance of your product in the customer environment but also the oversight of operations and some valuable business insights that you can leverage to enhance your revenues.
What are the different categories of metrics used to monitor such applications?
Broadly, the application monitoring metrics are divided into three categories.
The first and most important one is operational metrics. Generally, it covers the health of your services, all the related underlying microservices, and the interaction of the service in its environment. Operational metrics are usually real-time metrics that notify SaaS providers as soon as an application or any feature in the application ceases to function.
The second category of metrics is business metrics which determine if your service is offering the value that you designed it to offer. Business metrics can be real-time metrics but they usually are not. They are measured over weeks and months to identify a trend.
The third type of metric is performance metrics. An application may be running and providing the value it was meant to, but it may be taking a long time to load, or TCP handshake is taking longer than expected, such measurements are defined by the performance metrics of the application. Performance metrics are semi-real-time metrics.
What are some good examples of operational metrics?
The basic metrics that SaaS providers can use for monitoring applications like operationality, CPU memory, or I/O usage are offered by the cloud for free. These are also metrics that don’t require any instrumentation in your code.
There are instances where metrics are measured through the use of instrumentation installed in your application. If the code crashes, these instrumentations will generate an alert for the undesired event across multiple microservices.
Metrics that are threshold-based and can generate a notification and/or when the threshold is breached are also used in operations depending on your services.
Operational metrics are layered in that order that you start small and add sophisticated metrics over that foundation.
What are some top-tier performance metrics for application monitoring?
Page load time is a good example of how you measure the performance of your service. Similarly the time it takes the customer to submit a form hosted on your website is another. The time it takes for my ride-hailing app to enter an address and then the time difference to when the cab driver receives the notification are all performance measurements that vary depending on your service.
For more sophisticated monitoring, metrics such as upon notification of an error, observing performance at a more granular level like TCP handshake time and SSL time, etc. are incorporated and you gradually build on top of simpler metrics.
Does microservices architecture pose more complexity compared to a monolithic architecture?
Microservices may be harder to build but are easier to monitor because the metric that is experiencing error will have the issue limited to that microservice alone.
Monolith applications are easier to build but are harder to monitor, simply because if something is rendered defective, debugging and addressing the problem can be very time-consuming and will affect every aspect of your business.
What are your recommendations regarding an alert system?
The Cloudwatch monitoring and observability solution for application and infrastructure by Amazon is a very powerful system. There are multiple alerting tools like email notifications, SNS, and SQL notifications that are built into it. It also offers multiple ticketing as well as communication system integrations, on-call tools, PagerDuty, and much more.
How do monitoring solutions help operations teams with correlation, triage, and root cause diagnosis?
As a SaaS service provider, you ought to know about a problem in your service or application before or when the customer complains. One application is supported by multiple microservices and identification of the performance issue and knowing where to find that issue before you can resolve it, is only possible through metrics for monitoring applications.
Once you have identified the problem and its source of a performance bottleneck, you can start fixing that problem to bring the application back to its original functional state and that’s how triage, correlation, and root cause diagnosis help you with identification, mediating the problem, and keeping your service functional.
Are there monitoring solutions where you are proactively looking and might foresee an event?
As a SaaS-based startup, your software stack is missing a key monitoring aspect if your system was unable to alert you hours ahead of your customer. Your customer informing you about an error should be the last resort. Usually, this last resort is seen in instances where SaaS providers missed or misjudged some metric and deemed it irrelevant once it broke in the customer environment and you weren’t notified. SaaS providers need to carefully choose their monitoring deck to avoid this oversight and monitoring needs to be done proactively rather than reactively.
How much should the cost be for any application monitoring solution?
Cloudwatch, on-call rotation systems, and other metrics are not super expensive but can get expensive as your application scales and requires more and more components. The human resources may be expensive, however, the actual dollar cost of setting up monitoring systems is generally pretty low.
How do we monitor API endpoints?
For APIs, I always suggest canaries, simply because it is the easiest, fastest, and painless way to monitor your API endpoints. Every public-facing API should have multiple canaries that are continually testing it. There are numerous tools available, in the market, by cloud providers and third-party providers for testing API endpoints, in terms of security testing, function, and performance testing.
Cloudwatch uses the concept of synthetics that allows you to set up a synthetic stack in another region from where you can continuously test your public API endpoints.
How does a startup founder go about setting up a monitoring system and which tools are the best ones for it?
As a startup founder, it is easy to fall for the idea that more data means more results. I would caution against it and instead suggest starting small. For instance, start with CPU memory, focus on mechanisms and processes, and make a trial of when things might break. Install solutions for those points. Start with a few metrics preferably that can fill in one single screen and then focus on your mechanisms, your backend system, your paging, and the culture of operational excellence rather than the metrics themselves. Once you have achieved that successfully without oversights, evolve from there.
What is the difference between agent-based and agentless monitoring?
In agentless monitoring, an agent is not required to be installed in the system for you to be able to monitor it. Systems that already exist emit those metrics automatically. Agent-based monitoring requires an agent to be installed when you integrate with an SDK and that agent collects all the desired metrics and emits them to your application backend. I recommend agentless monitoring but it can only take you so far. The agent-based approach provides much deeper insights. Take application performance monitoring as an example. Here, agent-based monitoring can allow you to passively monitor dom objects and measure the page load time via a short script that allows you to observe which part of the TCP handshake is taking longer, and so forth.
This depends on how deep you want to observe the operations or performance and also on the feasibility of inserting an agent-based solution.
What good vs. bad monitoring practices do you see in the market and what does the future look like for Monitoring Solutions?
For now, there are two strategies in the market: the new way of shipping software with monitoring and observability services vs. the old way of shipping systems that never break. For me, technology is easy, the people are the hardest part of the puzzle. But I have noticed that startups are stuck in legacy practices of not having visibility to their customer environment and their service performance.
And as far as the future is concerned, the practice of monitoring your application is the future, where you can measure performance metrics, supervise the operational metrics of your application and observe business metrics for what features are providing value and which are not. All these metrics provide you with significant data to make agile decisions for your startup. All successful companies today, as well as in the future, will have these metrics and will be able to look each other in the eye and say we made a mistake and we’re going to change it because the data tells us otherwise. I think such courageous teams are winning today and they will continue to win in the future as well.
To Watch the Complete Episode, Please Click on This YouTube Link: https://www.youtube.com/watch?v=jXJsaHr3lAM