Case Study: Grid Computing

Background

A large Investment Bank ran risk calculations on a compute farm with thousands of CPUs, but the jobs still took hours to finish. The objective was to find improvements in latency and throughput.

The calculation problem is “embarrassingly parallel” – they had about a million trades to value, under a hundred different market scenarios, and many of the risk calculations could be split into dozens of independent parts, so the total number of compute tasks numbered in the billions.

Approach

I got the grid team to embed an API so each layer of the calculation stack could call out to say what it was doing, and a service to capture this data and log it to disk for analysis. Initially, simple versions of the client and service using a TCP socket connection were deployed to check that everything worked, but in order to scale we re-engineered these components several times, moving to UDP data transport, multiple receivers, and using asynchronous IO to improve disk throughput.

Once I had the telemetry data, I built tools to analyse, visualise and summarise it so that, over time, I found many ways in which performance improvements could be made. These ranged from straightforward adjustments to batch sizes to make task lengths more consistent to new caching schemes for input data on each farm node.

Results

  • Batch run-times reduced by 40%
  • Compute volumes increased 3x without adding servers

Next Step

If you would like to discuss how these techniques could improve your business, contact bryan@performance-ninja.com