Chapter 2 of IoT Infrastructure
IoT applications are often designed for scale. A single Enterprise org can be responsible for thousands if not millions of IoT devices. IoT monitoring is a must to ensure reliability at this kind of scale. Without a monitoring solution to provide visibility into IoT device health and availability, outages are longer, and teams are inherently reactive instead of proactive.
Effective IoT monitoring enables teams to reduce mean time to recovery (MTTR), proactively respond to issues before they become outages, and improve overall system observability.
Monitoring IoT infrastructure often requires many more protocols, tools, and techniques than traditional IT infrastructure. IoT devices tend to be embedded systems with limited resources that may require more lightweight protocols and limit what tooling IT can use for monitoring.
In this article, we’ll take a closer look at what the different aspects of IoT monitoring are and how enterprises can more effectively monitor IoT infrastructure. To help you avoid common mistakes, we’ll also review five IoT monitoring best practices.
IoT Monitoring: Executive Summary
What to Monitor
|Thing Monitoring||Includes hardware, firmware, and the application. Examples include voltage levels, software reboots, etc.|
|Network Monitoring||Network metrics like latency, checksum mismatches, etc.|
|Cloud monitoring||Server and database metrics like CPU utilization, number of active connections, etc.|
|User Interface Monitoring||App metrics like crashes and ANRs (App Not Responding errors)|
How to Monitor
|Set up pipelines||Set up resources to gather, process and store the data required for monitoring|
|Set up alerts||Notify the right set of people as soon as something goes wrong|
|Set up dashboards||Form an aggregate overview of the performance of your system and define directions for the work going ahead|
|Set up automation||Mitigate some damage caused by an error by automatically taking some remedial measures|
IoT Monitoring: What to Monitor
An IoT system consists of these major components:
- The thing - The IoT controller and sensor
- The network - The medium used for transmitting data
- The cloud - The remote server collecting and processing data
- The user interface - The mobile or web app for the end-users
In the sections below, we’ll look at the key metrics you should monitor for each of these components.
IoT monitoring at a glance
The Monitoring of Things
Thing monitoring includes hardware, firmware, and application monitoring. For hardware, you can monitor parameters like voltage, current levels, temperature, and humidity. Monitoring enables you to detect fluctuations and configure alert thresholds. If these levels cross a threshold, you can issue an alert and take corrective action (like replacing batteries).
Another hardware monitoring use case is monitoring the connectivity of individual sensors. For example, you may periodically scan an I2C bus to ensure I2C sensors are active and raise an alarm if the sensor doesn’t respond (which may indicate a physical connection issue). You may also check the range of measurements transmitted by a sensor. If they are far out of range (continuously 0 or a max value), it may indicate a sensor malfunction.
For IoT firmware monitoring, you may monitor things like the number of reboots (both hardware and software reboots), memory usage, and driver-level error codes. For example, if the firmware is experiencing excessive software reboots, then perhaps there is a bug in the code that may require an Over-The-Air (OTA) update. If the memory usage crosses a threshold, teams may need to disable some lower priority tasks until memory usage is back below the threshold.
Depending on your use case, there may be other metrics that should be monitored. For example, consider an application where the device is in sleep mode when idle and switches to active mode on user interaction. The time taken to transition from sleep mode to active mode is an application-specific metric.
A good example of such a device is a smart door lock.
A smart door lock is designed to conserve energy by remaining in sleep mode until it detects user interaction. In sleep mode, the device uses minimal power just to keep its essential functions running. This generally includes waiting for specific signals to wake up, such as a signal from its sensors indicating user interaction.
A smart door lock might enter active mode when a user approaches the door with a registered smartphone. The lock can use Bluetooth Low Energy (BLE) to detect the proximity of the user's smartphone. Once the user's smartphone is detected, the smart lock wakes up and becomes ready for use. After the user has successfully unlocked the door, the device will then go back into sleep mode after a set period of time, thereby conserving its battery power.
This is just one example of how IoT devices can utilize sleep and active modes to optimize power consumption. This technique is particularly important for battery-powered devices or devices where energy conservation is a key concern.
IoT Monitoring: Network Monitoring
IoT data flows need to be monitored for errors. For network monitoring, you should track metrics like latency, packet errors, number of connection timeouts, jitter, and retransmissions.
For example, suppose an IoT device uses cellular communication. Each device has a SIM card from a cellular provider. If the devices in a particular region are experiencing poor connectivity and high latency, it may suggest a need to change the network provider for the devices shipped to that region.
Alternatively, if there is a high number of checksum mismatches, it may indicate the device is in a very noisy environment. In a vehicle tracking application, for example, the device may need to move away from the engine.
IoT Monitoring: Cloud Monitoring
Whenever an application involves cloud connectivity, monitoring the cloud is critical. Cloud downtime can affect multiple — or even all — IoT devices. Important metrics for IoT cloud monitoring include:
- CPU utilization
- Memory utilization
- Input/Output Operations per Second (IOPS)
- Number of active and idle connections
- Number of failed requests (requests returning 4xx or 5xx errors in case of HTTP)
- Number of authentication failures
For example, if there are 1000 devices in the field, and there are 50000 active connections and a large percentage of failed requests, it may indicate the server is under a Distributed Denial of Service (DDoS) attack. If there are 1000 active connections and 500 idle connections, it may indicate that devices are not terminating past connections when initiating new ones. To address the issue, IoT device firmware may need an update. If there are larger IOPS numbers than expected, perhaps there are redundant requests to eliminate or optimize.
If, as the number of devices increases, CPU and memory utilization starts crossing comfortable levels, it may suggest that it is time to either shift to higher capacity resources (vertical scaling) or increase the number of resources with the same capacity (horizontal scaling).
IoT Monitoring: User Interface Monitoring
User interface monitoring covers end-user application metrics and usage. For example, you may monitor metrics like the number of app crashes, the number of “application not responding” (ANR) errors, and API failures.
If you notice all devices running Android 10 or lower are experiencing a large number of crashes, you may need to test out your app specifically on these Android versions. Similarly, if an API fails after an app update, it may suggest a gap in test coverage.
What IoT Metrics Should You Monitor?
The above sections provide you with direction for selecting monitoring attributes, but the specific metrics you choose ultimately depend on your IoT use case and requirements. To help you understand where to start, here is a list of common IoT metrics:
Common Metrics For IoT Monitoring
|Thing||Hardware||Connectivity of wired sensors, voltage and current levels, critical temperature levels, unplug detection, peripheral errors like SD Card mount failure|
|Firmware||Count of hardware and software reboots, flash data corruption (through checksums), number of flash/EEPROM writes, OTA failures, rollback to factory settings, memory usage, driver errors|
|Application||Specific to your application. You should define metrics that directly impact overall user experience. Ex: Lock Percentage of the GPS in a vehicle tracking application|
|Network||Quality||Latency, signal strength, number of connection timeouts, ratio of delayed packets, and realtime packets|
|Environment||Signal to Noise Ratio, number of checksum mismatches in exchanged packets, number of abruptly dropped connections|
|Cloud||Utilization||CPU utilization, memory utilization, IOPS, data IN and data OUT per second (in bytes), packet queue length|
|Connectivity||Number of active connections with the compute server, number of active connections with the database, ratio of active and idle connections|
|API Health||Number of APIs with 4xx and 5xx responses, API queue length, number of authentication failures|
|User Interface||Mobile App||Number of crashes (grouped by OS, brand, and ROM), number of ANRs, number of API failures, average API response time|
|Web App||Number of screen loading timeouts (similar to ANRs), average page load time, utilization and connectivity metrics of the web server hosting the web app|
|Platform||Real-Time Event Processing||Internet Scale Throughput||Stateful Edge Device Processing||Cross-Region Replication||Geo-Fencing and Data-Pinning|
|Azure IoT Edge||✔️||✔️|
|AWS IoT Greengrass||✔️||✔️||✔️|
IoT Monitoring: How to Monitor
Now that we’ve covered the what of IoT monitoring, let’s explore the how.
Setting Up Pipelines
An IoT monitoring system requires data pipelines so alerts can quickly reach the right audience. At a high level, there are three common approaches to setting up IoT monitoring pipelines:
- Continuous- With continuous monitoring, data is continuously sent to your cloud in every data packet.
- Alert-based- With alert-based monitoring, only priority packets are sent to your cloud.
- Poll-based- With poll-based monitoring, the cloud or an agent polls the devices for metrics.
Monitoring API calls can follow a similar strategy. Network and cloud monitoring metrics can be defined and continuously monitored in the cloud.
Setting up Alerts
Once teams set up these pipelines, it's essential that the right people are notified when something goes wrong. You should start with simple tasks and add email and mobile notifications for subject matter experts. Depending on how your team works, you can even configure non-conventional alerting systems, like Slack notifications.
An escalation matrix needs to be created for each issue, and program your monitoring system to escalate issues whenever they remain unsolved for more than a threshold duration.
Setting up Dashboards
While alerts help draw attention to the immediate issue, dashboards provide the broader picture and help define the direction for the team to work on.
Setting up Automation
With time, your monitoring system can mature to automate issue response and remediation. For example, when a server goes down, the system can trigger an SMS message or a notification conveying to the users that your team is looking into the issue.
Similarly, if the temperature of a critical component is seen increasing beyond a threshold, the system can be shut down for a short duration. This can protect the component from permanent damage.
Note that monitoring will help teams understand the symptoms of an issue, but may not highlight the issue’s root cause. For example, a high load on your cloud server is only a symptom. The root cause may be a DDoS attack (requiring changes in security infrastructure) or a fault in your device that causes it to open multiple connections with your cloud (requiring changes in your firmware). What matters is a holistic understanding of the system, spending time understanding the problem before attempting to solve it, and learning from past mistakes.
IoT Monitoring: Best Practices
By following these best practices, teams can avoid common IoT monitoring pitfalls and improve overall system uptime and observability.
IoT Monitoring Best Practice #1: Don’t Confuse Monitoring With Logging And Analytics.
Monitoring, logging, and analytics are all important and all three require data. In simple terms, monitoring is about alerting and capturing data from metrics, logging is about recording state changes and errors, and analytics is about analyzing the data you capture and record.
To understand the differences, let’s consider the example of the battery-operated smart door lock application.
- Monitoring alerts you and the end-user if there is a break-in attempt, if the battery is about to drain, or if the server is down. Monitoring would also give you an overview of how many errors your system encountered, segregated by error codes.
- Logging collects and stores info and debug logs from the system. Examples include firmware events like BLE connection and disconnection, interrupt triggers or state changes; app events like lifecycle changes, OS interactions, user interactions, and so on. These can provide more context to each error detected with monitoring. Logging and monitoring can work together to help you find the root cause of errors.
- Analytics enables you to process data to improve your system and features. For example, you may analyze whether people use fingerprints for unlocking the door, or the keypad and if you can cut down on one feature in a future product. It is generally a good idea to keep the systems and resources separate for monitoring, logging, and analytics.
IoT Monitoring Best Practice #2: Don’t Over-monitor
Monitoring comes with several costs (processing, storage, and implementation). It is important to determine what to monitor in your system. For example, if your devices are designed for indoor use, you may not need temperature monitoring for your sensors.
IoT Monitoring Best Practice #3: Only Store What’s Relevant
Data retention is an important aspect of IoT monitoring strategy. If you store monitoring logs for one year, when you will only use the last seven days’ data to make decisions, you’re wasting effort and storage.
For better resource utilization, make sure only to store relevant data. If you’re not using most of the data you store today, consider reducing your log retention period or archiving old logs.
IoT Monitoring Best Practice #4: Categorize Your Alerts
Some alerts are more important than others. Categorize alarms as critical, warning, and info levels to help operators prioritize their attention during an incident.
Additionally, proper categorization can help reduce alert fatigue and simplify automation and alerting. For example, maybe only a subset of your team needs to get info-level emails.
Store, serve, and process data anywhere in the world
- Improve write performance with globally distributed active-active architecture
- Scale with a real-time data layer, accessible within 10ms proximity of 80% of the global population.
- Support multiple data types (KV, Docs, Graphs and Search) and streaming data
Monitoring is essential if you wish to build a scalable and reliable system. Each component of your IoT system requires monitoring, and by using the concepts we’ve covered here, you can begin to design your own IoT monitoring strategy.
Learn how Macrometa helps enterprises identify and track the location of objects or people in real-time. Schedule a demo.