What is the most effective way to add resilience for Java APIs?

Written by Irdeto Engineers | Apr 4, 2023 8:00:00 AM

Servers are designed to work with large volumes of users, but in unique cases, they can be overwhelmed and as a result, will display error messages to the connecting viewer. This, particularly within the video entertainment industry, can be irritating and, in some cases, detrimental for both the host and viewing parties.

When servers experience a surge in the number of users, the Central Processing Unit (CPU) can struggle to adequately handle these tasks. To combat this issue, our Irdeto Engineers walk you through some of the different strategies and how they ultimately solve the issue.

What does the platform look like?

The system our engineers work with is called a digital rights and rules manager, commonly referred to as Digital Rights Management (DRM) software. It’s an approach to copyright protection for digital media, tasked with preventing the unauthorized redistribution of said media and restricting how consumers interact with the content they have purchased.

The system operates with synchronous key encryption, where an encryption key is used to encrypt and secure content within the operator’s content delivery network. When a user wants to interact with this content, they would need a license containing the decryption key of the same pair.

For video entertainment, the secure storage of both High Definition (HD) and especially Ultra-High Definition (UHD), is essential. The DRM and encryption regulate the interaction by setting geo-restrictions and even time-limits. When a user wants to obtain a license to interact with the content, they will need to go back to our service.

What are the top contributing server issues?

While a server is designed to handle traffic efficiently, there is only so much it can do. Here are the most common ways that contribute to server errors:

Hardware failures can occur anytime, usually from human error or power outages.
Network failures can occur when there is connectivity, configuration, bandwidth or security issues.
Partial failures are when the applications or systems only carry out partial functionality.
Spike loads occur when there is a surge in the number of users within a very tight timeframe.

Each of these errors contributes to a system or server struggling to carry out its proper and intended functionality. When there is, for example, a high-profile sports game on and millions of users tune in within a short timeframe, there is a large spike in the number of users, which if handled improperly, can cause shutdown errors.

How do you add application resilience?

While hardware and network failures are detrimental to the functionality of a server, our engineers focused on the problem of accurately handling the spike load issues on the server side. When there is a spike-related issue, flow control and request prioritization need to be addressed.

Flow control refers to the way in which concurrent client requests are handled, managing the memory and CPU resources so as not to overwhelm the API server.
Request prioritization refers to the way in which the CPU executes the tasks.

Currently, HTTP servers provide flow control at the container level, treating the resources in our application at the same level of priority. When a user sends a request for one of the resources, the requests build a queue, resulting in failures, time outs or delayed processing. And if the server cannot handle the requests, an error is displayed.

To better address the backlog of requests, our engineers applied logic to handle the flow control per resource. This allowed for the prioritization of requests for different resources to be flow controlled separately.

Want to learn more?

Our engineers have documented and explained the full process behind solving the issue of imperfect HTTP server issues failing to handle spike loads. To read the full piece, download our eBook “Improving the resilience of Java API servers: The system at a glance”.

View full post