Chapter 4 of IoT Infrastructure
IoT applications are often designed for scale. A single organization may be responsible for the reliability of hundreds, or even thousands of IoT devices. To ensure reliability at this kind of scale, IoT monitoring is a must. Without a monitoring solution to provide visibility into IoT device health and availability, outages are longer, and teams are inherently reactive instead of proactive.
Effective IoT monitoring — even for smaller-scale deployments — enables teams to reduce mean time to recovery (MTTR), proactively respond to issues before they become outages, and improve overall system observability.
Monitoring IoT infrastructure often requires different protocols, tools, and techniques than traditional IT infrastructure, however. IoT networks generate a wealth of data from various smart devices. Additionally, IoT devices tend to be embedded systems with limited resources that may require more lightweight protocols and limit what tooling IT can use for monitoring.
In this article, we’ll take a closer look at what the different aspects of IoT monitoring are and how you can effectively monitor your IoT infrastructure. To help you avoid common mistakes, we’ll also review five IoT monitoring best practices.
IoT Monitoring: Executive Summary
What to Monitor
|Thing Monitoring||Includes hardware, firmware, and the application. Examples include voltage levels, software reboots, etc.|
|Network Monitoring||Network metrics like latency, checksum mismatches, etc.|
|Cloud monitoring||Server and database metrics like CPU utilization, number of active connections, etc.|
|User Interface Monitoring||App metrics like crashes and ANRs (App Not Responding errors)|
How to Monitor
|Set up pipelines||Set up resources to gather, process and store the data required for monitoring|
|Set up alerts||Notify the right set of people as soon as something goes wrong|
|Set up dashboards||Form an aggregate overview of the performance of your system and define directions for the work going ahead|
|Set up automation||Mitigate some damage caused by an error by automatically taking some remedial measures|
IoT Monitoring: What to Monitor
An IoT system consists of these major components:
- The thing- The IoT controller and sensor
- The network- The medium used for transmitting data
- The cloud- The remote server collecting and processing data
- The user interface- The mobile or web app for the end-users
In the sections below, we’ll look at the key metrics you should monitor for each of these components.
IoT monitoring at a glance
The Monitoring of Things
Thing monitoring includes hardware, firmware, and application monitoring. For hardware, you can monitor parameters like voltage, current levels, temperature, and humidity. Monitoring enables you to detect fluctuations and configure alert thresholds. If these levels cross a threshold, you can issue an alert and take corrective action (like replacing batteries).
Another hardware monitoring use case is monitoring the connectivity of individual sensors. For example, you may periodically scan an I2C bus to ensure I2C sensors are active and raise an alarm if the sensor doesn’t respond (which may indicate a physical connection issue). You may also check the range of measurements transmitted by a sensor. If they are far out of range (continuously 0 or a max value), it may indicate a sensor malfunction.
For IoT firmware monitoring, you may monitor things like the number of reboots (both hardware and software reboots), memory usage, and driver-level error codes. For example, if the firmware is experiencing excessive software reboots, then perhaps there is a bug in the code that you may need to fix using an Over-The-Air (OTA) update. If the memory usage crosses a threshold, you may need to disable some low priority tasks until memory usage is back below the threshold.
Depending on your use case, there may be other metrics you should monitor. For example, consider an application where your device is in sleep mode when idle and switches to active mode on user interaction. The time taken to transition from sleep mode to active mode is an application-specific metric.
IoT Monitoring: Network Monitoring
IoT data flows need to be monitored for errors. For network monitoring, you should track metrics like latency, packet errors, number of connection timeouts, jitter, and retransmissions.
For example, suppose your IoT devices use cellular communication. Each device has a SIM card from a cellular provider (Vodaphone, Verizon, T-Mobile, etc.). If you observe that devices in a particular region are experiencing poor connectivity and high latency, it may suggest you need to change the network provider for the devices shipped to that region.
Alternatively, if you observe a high number of checksum mismatches, it may indicate that your device is in a very noisy environment. In a vehicle tracking application, for example, you may consider moving the device away from the engine.
IoT Monitoring: Cloud Monitoring
Whenever an application involves cloud connectivity, monitoring the cloud is critical. Cloud downtime can affect multiple — or even all — of your IoT devices. Important metrics for IoT cloud monitoring include:
- CPU utilization
- Memory utilization
- Input/Output Operations per Second (IOPS)
- Number of active and idle connections
- Number of failed requests (requests returning 4xx or 5xx errors in case of HTTP)
- Number of authentication failures
For example, if you have 1000 devices in the field, and you see 50000 active connections and a large percentage of failed requests, it may indicate that your server is under a Distributed Denial of Service (DDoS) attack. If you see 1000 active connections and 500 idle connections, it may indicate that your devices are not terminating past connections when initiating new ones. To address the issue, you may need to upgrade IoT device firmware. If you see a larger IOPS number than expected, perhaps there are redundant requests that you can eliminate or optimize.
If, as the number of devices increases, your CPU utilization and memory utilization start crossing comfortable levels, it may suggeststhat it is time to either shift to higher capacity resources (vertical scaling) or increase the number of resources with the same capacity (horizontal scaling).
IoT Monitoring: User Interface Monitoring
User interface monitoring covers end-user application metrics and usage. For example, you may monitor metrics like the number of app crashes, the number of “application not responding” (ANR) errors, and API failures.
If you notice all devices running Android 10 or lower are experiencing a large number of crashes, you may need to test out your app specifically on these Android versions. Similarly, if an API fails after an app update, it may suggest a gap in test coverage.
What IoT Metrics Should You Monitor?
The above sections provide you with direction for selecting monitoring attributes, but the specific metrics you choose ultimately depend on your IoT use case and requirements. To help you understand where to start, here is a list of common IoT metrics:
Common Metrics For IoT Monitoring
|Thing||Hardware||Connectivity of wired sensors, voltage and current levels, critical temperature levels, unplug detection, peripheral errors like SD Card mount failure|
|Firmware||Count of hardware and software reboots, flash data corruption (through checksums), number of flash/EEPROM writes, OTA failures, rollback to factory settings, memory usage, driver errors|
|Application||Specific to your application. You should define metrics that directly impact overall user experience. Ex: Lock Percentage of the GPS in a vehicle tracking application|
|Network||Quality||Latency, signal strength, number of connection timeouts, ratio of delayed packets, and realtime packets|
|Environment||Signal to Noise Ratio, number of checksum mismatches in exchanged packets, number of abruptly dropped connections|
|Cloud||Utilization||CPU utilization, memory utilization, IOPS, data IN and data OUT per second (in bytes), packet queue length|
|Connectivity||Number of active connections with the compute server, number of active connections with the database, ratio of active and idle connections|
|API Health||Number of APIs with 4xx and 5xx responses, API queue length, number of authentication failures|
|User Interface||Mobile App||Number of crashes (grouped by OS, brand, and ROM), number of ANRs, number of API failures, average API response time|
|Web App||Number of screen loading timeouts (similar to ANRs), average page load time, utilization and connectivity metrics of the web server hosting the web app|
|Platform||Real-Time Event Processing||Internet Scale Throughput||Stateful Edge Device Processing||Cross-Region Replication||Geo-Fencing and Data-Pinning|
|Azure IoT Edge||✔️||✔️|
|AWS IoT Greengrass||✔️||✔️||✔️|
IoT Monitoring: How to Monitor
Now that we’ve covered the what of IoT monitoring, let’s explore the how.
Set Up Pipelines
An IoT monitoring system requires data pipelines so alerts can quickly reach the right audience. At a high level, there are 3 common approaches to setting up IoT monitoring pipelines:
- Continuous- With continuous monitoring, data is continuously sent to your cloud in every data packet.
- Alert-based- With alert-based monitoring, only priority packets are sent to your cloud.
- Poll-based- With poll-based monitoring, the cloud or an agent polls the devices for metrics.
Monitoring API calls can follow a similar strategy. Network and cloud monitoring metrics can be defined and continuously monitored in the cloud. Several cloud service providers have their own services for doing this. For example, AWS has CloudWatch.
You can provision up a separate set of resources (virtual machines and databases) in the cloud for collecting, processing, and storing the data from these pipelines. For apps, you can establish pipelines from the app backend (for example, Google Firebase or AWS Amplify) to your monitoring resources.
Set up Alerts
Once pipelines are set up, you can then move on to informing the right people when something goes wrong. You should start with simple tasks (like setting up basic Cloudwatch Alarms, if you are using AWS). Then, add email alerts and mobile notifications for subject matter experts. Depending on how your team works, you can even configure non-conventional alerting systems, like Slack notifications.
Make sure to define an escalation matrix for each issue, and program your monitoring system to escalate issues whenever they remain unsolved for more than a threshold duration.
Set up Dashboards
While alerts help draw attention to the immediate issue, dashboards provide the broader picture, and help define the direction for the team to work on. Again, you can start with simpler tasks, like using the standard dashboard provided by your cloud service provider. Later on, you can move to tools like Metabase or Grafana for more sophisticated and customizable dashboards.
A sample CloudWatch dashboard (Source)
Set up Automation
With time, your monitoring system can mature to automate issue response and remediation. For example, when a server goes down, the system can trigger an SMS message or a notification conveying to the users that your team is looking into the issue.
Similarly, if the temperature of a critical component is seen increasing beyond a threshold, the system can be shut down for a short duration. This can protect the component from permanent damage.
Note that monitoring will help you understand the symptoms of an issue, but may not highlight the issue’s root cause. For example, a high load on your cloud server is only a symptom. The root cause may be a DDoS attack (requiring changes in security infrastructure) or a fault in your device that causes it to open multiple connections with your cloud (requiring changes in your firmware). What matters is a holistic understanding of the system, spending time to understand the problem before attempting to solve it, and learning from past mistakes.
IoT Monitoring: Best Practices
Now that you know the what and the how of IoT monitoring, let’s take a look at some best practices. By following these best practices you can avoid common IoT monitoring pitfalls and improve overall system uptime and observability.
IoT Monitoring Best Practice #1: Don’t Confuse Monitoring With Logging And Analytics.
Monitoring, logging, and analytics are all important and all three require data. In simple terms, monitoring is about alerting and capturing data from metrics, logging is about recording state changes and errors, and analytics is about analyzing the data you capture and record.
To understand the differences, let’s consider an example of a battery-operated smart door lock application.
- Monitoring alerts you and the end-user if there is a break-in attempt, if the battery is about to drain, or if the server is down. Monitoring would also give you an overview of how many errors your system encountered, segregated by error codes.
- Logging collects and stores info and debug logs from the system. Examples include firmware events like BLE connection and disconnection, interrupt triggers or state changes; app events like lifecycle changes, OS interactions, user interactions, and so on. These can provide more context to each error detected with monitoring. Logging and monitoring can work together to help you find the root cause of errors.
- Analytics enables you to process data to improve your system and features. For example, you may analyze whether people use fingerprints for unlocking the door, or the keypad and if you can cut down on one feature in a future product. It is generally a good idea to keep the systems and resources separate for monitoring, logging, and analytics.
IoT Monitoring Best Practice #2: Don’t Over-monitor
Monitoring comes with several costs (processing, storage, and implementation). It is important to determine what to monitor in your system. For example, if your devices are designed for indoor use, you may not need temperature monitoring for your sensors.
IoT Monitoring Best Practice #3: Only Store What’s Relevant
Data retention is an important aspect of IoT monitoring strategy. If you store monitoring logs for 1 year, when you will only use the last 7 days’ data to make decisions, you’re wasting effort and storage.
To reduce storage costs and resource utilization, make sure to only store relevant data. If you’re not using most of the data you store today, consider reducing your log retention period or archiving old logs.
IoT Monitoring Best Practice #4: Categorize Your Alerts
Some alerts are more important than others. Categorize alarms as critical, warning, and info levels to help operators prioritize their attention during an incident.
Additionally, proper categorization can help reduce alert fatigue and simplify automation and alerting. For example, maybe only a subset of your team needs to get info-level emails.
IoT Monitoring Best Practice #5: Don’t Reinvent The Wheel
At this point, it is easy to get overwhelmed thinking about setting up a monitoring and alerting system. But there are open source tools you can use to get the job done.
For example, Metabase is a tool that can be used for both analytics and monitoring. It has built-in features for dashboarding and sharing email alerts. Your cloud service provider (AWS, Azure, etc.) also comes with alert triggers. Similarly, Google Play Store and Apple App Store also provide crashes and ANR reports for mobile devices.
Edge computing needs a next generation database technology
- Ultra fast distributed writes with Conflict-free Replicated Data Types (CRDTs)
- Solve scaling constraints due to geo-distributed time-stamping with Version Vectors
- A unified query language for KV, Docs, Graphs and Search with C8QL
Monitoring is essential if you wish to build a scalable and reliable system. Each component of your IoT system requires monitoring, and by using the concepts we’ve covered here, you can begin to design your own IoT monitoring strategy.
Learn how Macrometa helps enterprises identify and track the location of objects or people in real-time. Schedule a demo.
Like the Article?
Subscribe to our LinkedIn Newsletter to receive more educational content.