How PostNL processes billions of IoT occasions with Amazon Managed Service for Apache Flink

This publish is co-written with Çağrı Çakır and Özge Kavalcı from PostNL.

PostNL is the designated common postal service supplier for the Netherlands and has three major enterprise items providing postal supply, parcel supply, and logistics options for ecommerce and cross-border options. With 5,800 retail factors, 11,000 mailboxes, and over 900 automated parcel lockers, the corporate performs an vital function within the logistics worth chain. It goals to be the supply group of alternative by making it as simple as potential to ship and obtain parcels and mail. With virtually 34,000 staff, PostNL is on the coronary heart of society. On a typical weekday, the corporate delivers a mean of 1.1 million parcels and 6.9 million letters throughout Belgium, Netherlands, and Luxemburg.

On this publish, we describe the legacy PostNL stream processing resolution, its challenges, and why PostNL selected Amazon Managed Service for Apache Flink to assist modernize their Web of Issues (IoT) knowledge stream processing platform. We offer a reference structure, describe the steps we took emigrate to Apache Flink, and the teachings realized alongside the way in which.

With this migration, PostNL has been in a position to construct a scalable, sturdy, and extendable stream processing resolution for his or her IoT platform. Apache Flink is an ideal match for IoT. Scaling horizontally, it permits processing the sheer quantity of knowledge generated by IoT gadgets. With occasion time semantics, you may appropriately deal with occasions within the order they have been generated, even from often disconnected gadgets.

PostNL is happy concerning the potential of Apache Flink, and now plans to make use of Managed Service for Apache Flink with different streaming use instances and shift extra enterprise logic upstream into Apache Flink.

Apache Flink and Managed Service for Apache Flink

Apache Flink is a distributed computation framework that enables for stateful real-time knowledge processing. It offers a single set of APIs for constructing batch and streaming jobs, making it simple for builders to work with bounded and unbounded knowledge. Managed Service for Apache Flink is an AWS service that gives a serverless, absolutely managed infrastructure for operating Apache Flink purposes. Builders can construct extremely out there, fault-tolerant, and scalable Apache Flink purposes with ease and with no need to change into an professional in constructing, configuring, and sustaining Apache Flink clusters on AWS.

The problem of real-time IoT knowledge at scale

Immediately, PostNL’s IoT platform, Curler Cages resolution, tracks greater than 380,000 belongings with Bluetooth Low Vitality (BLE) expertise in close to actual time. The IoT platform was designed to supply availability, geofencing, and backside state occasions of every asset through the use of telemetry sensor knowledge corresponding to GPS factors and accelerometers which can be coming from Bluetooth gadgets. These occasions are utilized by completely different inside customers to make logistical operations simple to plan, extra environment friendly, and sustainable.

How PostNL processes billions of IoT occasions with Amazon Managed Service for Apache Flink

Monitoring this excessive quantity of belongings emitting completely different sensor readings inevitably creates billions of uncooked IoT occasions for the IoT platform in addition to for the downstream techniques. Dealing with this load repeatedly each inside the IoT platform and all through the downstream techniques was neither cost-efficient nor simple to keep up. To scale back the cardinality of occasions, the IoT platform makes use of stream processing to mixture knowledge over fastened time home windows. These aggregations should be primarily based on the second when the system emitted the occasion. This kind of aggregation primarily based on occasion time turns into complicated when messages could also be delayed and arrive out of order, which can continuously occur with IoT gadgets that may get disconnected briefly.

The next diagram illustrates the general circulate from edge to the downstream techniques.

PostNL IoT workflow

The workflow consists of the next parts:

The sting structure consists of IoT BLE gadgets that function sources of telemetry knowledge, and gateway gadgets that join these IoT gadgets to the IoT platform.
Inlets comprise a set of AWS companies corresponding to AWS IoT Core and Amazon API Gateway to gather IoT detections utilizing MQTTS or HTTPS and ship them to the supply knowledge stream utilizing Amazon Kinesis Knowledge Streams.
The aggregation software filters IoT detections, aggregates them for a set time window, and sinks aggregations to the vacation spot knowledge stream.
Occasion producers are the mix of various stateful companies that generate IoT occasions corresponding to geofencing, availability, backside state, and in-transit.
Retailers, together with companies corresponding to Amazon EventBridge, Amazon Knowledge Firehose, and Kinesis Knowledge Streams, ship produced occasions to customers.
Shoppers, that are inside groups, interpret IoT occasions and construct enterprise logic primarily based on them.

The core part of this structure is the aggregation software. This part was initially carried out utilizing a legacy stream processing expertise. For a number of causes, as we focus on shortly, PostNL determined to evolve this important part. The journey of changing the legacy stream processing with Managed Service for Apache Flink is the main focus of the remainder of this publish.

The choice emigrate the aggregation software to Managed Service for Apache Flink

Because the variety of linked gadgets grows, so does the need for a sturdy and scalable platform able to dealing with and aggregating huge volumes of IoT knowledge. After thorough evaluation, PostNL opted emigrate to Managed Service for Apache Flink, pushed by a number of strategic concerns that align with evolving enterprise wants:

Enhanced knowledge aggregation – Utilizing Apache Flink’s robust capabilities in real-time knowledge processing permits PostNL to effectively mixture uncooked IoT knowledge from varied sources. The power to increase the aggregation logic past what was offered by the present resolution can unlock extra refined analytics and extra knowledgeable decision-making processes.
Scalability – The managed service offers the power to scale your software horizontally. This enables PostNL to deal with growing knowledge volumes effortlessly because the variety of IoT gadgets grows. This scalability implies that knowledge processing capabilities can broaden in tandem with the enterprise.
Deal with core enterprise – By adopting a managed service, the IoT platform staff can give attention to implementing enterprise logic and develop new use instances. The educational curve and overhead of working Apache Flink at scale would have diverted helpful energies and sources of the comparatively small staff, slowing down the adoption course of.
Price-effectiveness – Managed Service for Apache Flink employs a pay-as-you-go mannequin that aligns with operational budgets. This flexibility is especially useful for managing prices consistent with fluctuating knowledge processing wants.

Challenges of dealing with late occasions

Widespread stream processing use instances require aggregating occasions primarily based on after they have been generated. That is known as occasion time semantics. When implementing this sort of logic, it’s possible you’ll encounter the issue of delayed occasions, by which occasions attain your processing system late, lengthy after different occasions generated across the identical time.

Late occasions are widespread in IoT because of causes inherent to the surroundings, corresponding to community delays, system failures, briefly disconnected gadgets, or downtime. IoT gadgets usually talk over wi-fi networks, which might introduce delays in transmitting knowledge packets. And typically they might expertise intermittent connectivity points, leading to knowledge being buffered and despatched in batches after connectivity is restored. This will likely lead to occasions being processed out of order—some occasions could also be processed a number of minutes after different occasions that have been generated across the identical time.

Think about you wish to mixture occasions generated by gadgets inside a selected 10-second window. If occasions may be a number of minutes late, how will you ensure you will have obtained all occasions that have been generated in these 10 seconds?

A easy implementation could anticipate a number of minutes, permitting late occasions to reach. However this methodology means you could’t calculate the results of your aggregation till a number of minutes later, growing the output latency. One other resolution can be ready just a few seconds, after which dropping any occasions arriving later.

Growing latency or dropping occasions that will comprise important info usually are not palatable choices for the enterprise. The answer should be an excellent compromise, a trade-off between latency and completeness.

Apache Flink gives occasion time semantics out of the field. In distinction to different stream processing frameworks, Flink gives a number of choices for coping with late occasions. We dive into how Apache Flink take care of late occasions subsequent.

A robust stream processing API

Apache Flink offers a wealthy set of operators and libraries for widespread knowledge processing duties, together with windowing, joins, filters, and transformations. It additionally consists of over 40 connectors for varied knowledge sources and sinks, together with streaming techniques like Apache Kafka and Amazon Managed Streaming for Apache Kafka, or Kinesis Knowledge Streams, databases, and likewise file system and object shops like Amazon Easy Storage Service (Amazon S3).

However crucial attribute for PostNL is that Apache Flink gives completely different APIs with completely different degree of abstractions. You can begin with the next degree of abstraction, SQL, or Desk API. These APIs summary streaming knowledge as extra acquainted tables, making them simpler to study for less complicated use instances. In case your logic turns into extra complicated, you may swap to the decrease degree of abstraction of the DataStream API, the place streams are represented natively, nearer to the processing occurring inside Apache Flink. In the event you want the finest-grained degree of management on how every single occasion is dealt with, you may swap to the Course of Perform.

A key studying has been that selecting one degree of abstraction in your software shouldn’t be an irreversible architectural choice. In the identical software, you may combine completely different APIs, relying on the extent of management you want at that particular step.

Scaling horizontally

To course of billions of uncooked occasions and develop with the enterprise, the power to scale was a necessary requirement for PostNL. Apache Flink is designed to scale horizontally, distributing processing and software state throughout a number of processing nodes, with the power to scale out additional when the workload grows.

For this explicit use case, PostNL needed to mixture the sheer quantity of uncooked occasions with comparable traits and over time, to scale back their cardinality and make the info circulate manageable for the opposite techniques downstream. These aggregations transcend easy transformations that deal with one occasion at a time. They require a framework able to stateful stream processing. That is precisely the kind of use case Apache Flink was designed for.

Superior occasion time semantics

Apache Flink emphasizes occasion time processing, which permits correct and constant dealing with of knowledge with respect to the time it occurred. By offering built-in assist for occasion time semantics, Flink can deal with out-of-order occasions and late knowledge gracefully. This functionality was basic for PostNL. As talked about, IoT generated occasions could arrive late and out of order. Nonetheless, the aggregation logic should be primarily based on the second the measurement was really taken by the system—the occasion time—and never when it’s processed.

Resiliency and ensures

PostNL had to verify no knowledge despatched from the system is misplaced, even in case of failure or restart of the appliance. Apache Flink gives robust fault tolerance ensures by means of its distributed snapshot-based checkpointing mechanism. Within the occasion of failures, Flink can get well the state of the computations and obtain exactly-once semantics of the end result. For instance, every occasion from a tool is rarely missed nor counted twice, even within the occasion of an software failure.

The journey of choosing the proper Apache Flink API

A key requirement of the migration was reproducing precisely the conduct of the legacy aggregation software, as anticipated by the downstream techniques that may’t be modified. This launched a number of further challenges, particularly round windowing semantics and late occasion dealing with.

As we’ve seen, in IoT, occasions could also be out of order by a number of minutes. Apache Flink gives two high-level ideas for implementing occasion time semantics with out-of-order occasions: watermarks and allowed lateness.

Apache Flink offers a variety of versatile APIs with completely different ranges of abstraction. After some preliminary analysis, Flink-SQL and the Desk API have been discarded. These increased ranges of abstraction present superior windowing and occasion time semantics, however couldn’t present the fine-grained management PostNL wanted to breed precisely the conduct of the legacy software.

The decrease degree of abstraction of the DataStream API additionally gives windowing aggregation capabilities, and lets you customise the behaviors with customized triggers, evictors, and dealing with late occasions by setting an allowed lateness.

Sadly, the legacy software was designed to deal with late occasions in a peculiar approach. The end result was a hybrid occasion time and processing time logic that couldn’t be simply reproduced utilizing high-level Apache Flink primitives.

Fortuitously, Apache Flink gives an extra decrease degree of abstraction, the ProcessFunction API. With this API, you will have the finest-grained management on software state, and you should use timers to implement nearly any customized time-based logic.

PostNL determined to go on this path. The aggregation was carried out utilizing a KeyedProcessFunction that gives a technique to carry out arbitrary stateful processing on keyed streams—logically partitioned streams. Uncooked occasions from every IoT system are aggregated primarily based on their occasion time (the timestamp written on the occasion by the supply system) and the outcomes of every window is emitted primarily based on processing time (the present system time).

This fine-grained management lastly allowed PostNL to breed precisely the conduct anticipated by the downstream purposes.

The journey to manufacturing readiness

Let’s discover the journey of migrating to Managed Service for Apache Flink, from the beginning of the venture to the rollout to manufacturing.

Figuring out necessities

Step one of the migration course of centered on completely understanding the prevailing system’s structure and efficiency metrics. The objective was to supply a seamless transition to Managed Service for Apache Flink with minimal disruption to ongoing operations.

Understanding Apache Flink

PostNL wanted to familiarize themselves with the Managed Service for Apache Flink software and its streaming processing capabilities, together with built-in windowing methods, aggregation capabilities, occasion time vs. processing time variations, and eventually KeyProcessFunction and mechanisms for dealing with late occasions.

Completely different choices have been thought-about, utilizing primitives offered by Apache Flink out of the field, for occasion time logic and late occasions. The most important requirement was to breed precisely the conduct of the legacy software. The power to change to utilizing a decrease degree of abstraction helped. Utilizing the finest-grained management allowed by the ProcessFunction API, PostNL was in a position to deal with late occasions precisely because the legacy software.

Designing and implementing ProcessFunction

The enterprise logic is designed utilizing ProcessFunction to emulate the peculiar conduct of the legacy software in dealing with late occasions with out excessively delaying the preliminary outcomes. PostNL determined to make use of Java for the implementation, as a result of Java is the first language for Apache Flink. Apache Flink lets you develop and take a look at your software domestically, in your most popular built-in improvement surroundings (IDE), utilizing all of the out there debug instruments, earlier than deploying it to Managed Service for Apache Flink. Java 11 with Maven compiler was used for implementation. For extra details about IDE necessities, consult with Getting began with Amazon Managed Service for Apache Flink (DataStream API).

Testing and validation

The next diagram reveals the structure used to validate the brand new software.

Testing architecture

To validate the conduct of the ProcessFunction and late occasion dealing with mechanisms, integration exams have been designed to run each the legacy software and the Managed Service for Flink software in parallel (Steps 3 and 4). This parallel execution allowed PostNL to straight evaluate the outcomes generated by every software below an identical situations. A number of integration take a look at instances push knowledge to the supply stream (2) in parallel (7) and wait till their aggregation window is full, then they pull the aggregated outcomes from the vacation spot stream to check (8). Integration exams are mechanically triggered by the CI/CD pipeline after deployment of the infrastructure is full. Throughout the integration exams, the first focus was on attaining knowledge consistency and processing accuracy between the legacy software and the Managed Service for Flink software. The output streams, aggregated knowledge, and processing latencies have been in comparison with validate that the migration didn’t introduce any sudden discrepancies. For writing and operating the mixing exams, Robotic Framework, an open supply automation framework, was utilized.

After the mixing exams are handed, there’s yet one more validation layer: end-to-end exams. Just like the mixing exams, end-to-end exams are mechanically invoked by the CI/CD pipeline after the deployment of the platform infrastructure is full. This time, a number of end-to-end take a look at instances ship knowledge to AWS IoT Core (1) in parallel (9) and verify the aggregated outcomes from the vacation spot S3 bucket (5, 6) dumped from the output stream to check (10).

Deployment

PostNL determined to run the brand new Flink software on shadow mode. The brand new software ran for a while in parallel with the legacy software, consuming precisely the identical inputs, and sending output from each purposes to an information lake on Amazon S3. This allowed them to check the outcomes of the 2 purposes utilizing actual manufacturing knowledge, and likewise to check the steadiness and efficiency of the brand new one.

Efficiency optimization

Throughout migration, the PostNL IoT platform staff realized how the Flink software may be fine-tuned for optimum efficiency, contemplating elements corresponding to knowledge quantity, processing velocity, and environment friendly late occasion dealing with. A very attention-grabbing side was to confirm that the state dimension wasn’t growing unbounded over the long run. A threat of utilizing the finest-grained management of ProcessFunction is state leak. This occurs when your implementation, straight controlling the state within the ProcessFunction, misses some nook instances the place a state is rarely deleted. This causes the state to develop unbounded. As a result of streaming purposes are designed to run repeatedly, an increasing state can degrade efficiency and ultimately exhaust reminiscence or native disk area.

With this part of testing, PostNL discovered the proper stability of software parallelism and sources—together with compute, reminiscence, and storage—to course of the traditional every day workload profile with out lag, and deal with occasional peaks with out over-provisioning, optimizing each efficiency and cost-effectiveness.

Remaining swap

After operating the brand new software in shadow mode for a while, the staff determined the appliance was steady and emitting the anticipated output. The PostNL IoT platform lastly converted to manufacturing and shut down the legacy software.

Key takeaways

Among the many a number of learnings gathered within the journey of adopting Managed Service for Apache Flink, some are notably vital, and proving key when increasing to new and numerous use instances:

Perceive occasion time semantics – A deep understanding of occasion time semantics is essential in Apache Flink for precisely implementing time-dependent knowledge operations. This data makes positive occasions are processed appropriately relative to after they really occurred.
Use the highly effective Apache Flink API – Apache Flink’s API permits for the creation of complicated, stateful streaming purposes past primary windowing and aggregations. It’s vital to completely grasp the in depth capabilities provided by the API to deal with refined knowledge processing challenges.
With energy comes extra duty – The superior performance of Apache Flink’s API brings vital duty. Builders should be sure that purposes are environment friendly, maintainable, and steady, requiring cautious useful resource administration and adherence to finest practices in coding and system design.
Don’t combine occasion time and processing time logic – Combining occasion time and processing time for knowledge aggregation presents distinctive challenges. It prevents you from utilizing higher-level functionalities offered out of the field by Apache Flink. The bottom degree of abstractions amongst Apache Flink APIs permit for implementing customized time-based logic, however require a cautious design to realize accuracy and well timed outcomes, alongside in depth testing to validate good efficiency.

Conclusion

Within the journey of adopting Apache Flink, the PostNL staff realized how the highly effective Apache Flink APIs can help you implement complicated enterprise logic. The staff got here to understand how Apache Flink may be utilized to resolve a number of and numerous issues, and they’re now planning to increase it to extra stream processing use instances.

With Managed Service for Apache Flink, the staff was in a position to give attention to the enterprise worth and implementing the required enterprise logic, with out worrying concerning the heavy lifting of organising and managing an Apache Flink cluster.

To study extra about Managed Service for Apache Flink and choosing the proper managed service possibility and API in your use case, see What’s Amazon Managed Service for Apache Flink. To expertise hands-on find out how to develop, deploy, and function Apache Flink purposes on AWS, see the Amazon Managed Service for Apache Flink Workshop.

Concerning the Authors

Çağrı Çakır is the Lead Software program Engineer for the PostNL IoT platform, the place he manages the structure that processes billions of occasions every day. As an AWS Licensed Options Architect Skilled, he makes a speciality of designing and implementing event-driven architectures and stream processing options at scale. He’s obsessed with harnessing the ability of real-time knowledge, and devoted to optimizing operational effectivity and innovating scalable techniques.

Ozge Kavalci Özge Kavalcı works as Senior Resolution Engineer for the PostNL IoT platform and likes to construct cutting-edge options that combine with the IoT panorama. As an AWS Licensed Options Architect, she makes a speciality of designing and implementing extremely scalable serverless architectures and real-time stream processing options that may deal with unpredictable workloads. To unlock the complete potential of real-time knowledge, she is devoted to shaping the way forward for IoT integration.

Amit Singh works as a Senior Options Architect at AWS with enterprise clients on the worth proposition of AWS, and participates in deep architectural discussions to verify options are designed for profitable deployment within the cloud. This consists of constructing deep relationships with senior technical people to allow them to be cloud advocates. In his free time, he likes to spend time along with his household and study extra about the whole lot cloud.

Lorenzo Nicora works as Senior Streaming Options Architect at AWS serving to clients throughout EMEA. He has been constructing cloud-centered, data-intensive techniques for a number of years, working within the finance business each by means of consultancies and for fintech product firms. He has used open-source applied sciences extensively and contributed to a number of initiatives, together with Apache Flink.