­
Planning for the unplannable

Planning for the unplannable

Published on the 13/03/2025 | Written by Heather Wright


Planning for the unplannable

CISO’s learnings from UniSuper black swan event…

Vijay Krishnan vividly remembers the moment when he found out UniSuper, a $135 billion Australian super fund with more than 647,000 members had a big issue on its hands.

“It was May 1, 2024, 2am. We started receiving alerts. It started with gradual alerting. Something was wrong with our production systems. But the volume of alerts increased and we realised something serious was happening,” the UniSuper CISO says.

“Please don’t do disaster recovery for the sake of ticking the box.”

The incident attracted global attention, putting the spotlight firmly on UniSuper and its cloud provider.

While the outage wasn’t related to cybersecurity – a misconfiguration at the organisation’s cloud service provider, Google Cloud saw UniSuper’s accounts deleted overnight and without warning – but it proved a testing ground of UniSuper’s resilience, crisis management and team work and required an extensive recovery and restoration effort. (Member-facing services were restored in two weeks.)

Krishnan, who took to the stage at Gartner’s Security & Risk Management Summit in Sydney recently to detail the experience, credited work done to build organisational resilience with enabling UniSuper to recover, but says even with all the preparation, there were still some big learnings for the organisation.

UniSuper has production systems dispersed across multiple cloud service providers and within individual cloud service providers it also has multi-site redundancy.

“That was an important factor.”

The organisation’s alerting system is also hosted in a different service provider – something that was instrumental in ensuring alerts were received in a timely manner.

Robust backups were also key. The company had backups of its core production systems not just with Google Cloud, but also with two other service providers.

Rigorous disaster recovery planning and simulations were also instrumental, even though they hadn’t anticipated a black swan event.

“The key think once you finish your disaster recovery testing is there will be some lessons learned. Be sure you fix those before the next disaster or testing!”

One lesson UniSuper learned: It hadn’t kept its disaster recovery plan up to date.

Business continuity plans and simulation also helped with recovery.

“We had done one previous to the incident, which was cyber security focused, but it really helped us in this scenario as well. There is not much difference – systems are down and we need to recover.”

Crisis management, and practicing ‘what good communication needs to look like’ was also something UniSuper had tested and fixed prior to the May outage.

Even with all the foundations in place, Krishnan says there were plenty of lessons learned from the May outage.

“Nothing prepares you for an event like this! It’s all good practice and you will learn a lot of things, but in an actual event as big as this, nothing prepares you for it.”

He urged attendees to consider the possibility of a black swan event and imagine their entire cloud service provider, and production system, is down.

“How would you recover? That should be one of your disaster recovery test scenarios.

“Because we didn’t prepare for it there were some challenges.”

Among those was understanding what type of team was needed during the crisis.

Around 100 people were involved internally in the recovery process and Krishnan says understanding the skills and capabilities of those people – and their ability to handle the intense pressure of a major outage – was ‘a bit of a challenge’.

The organisation had also never practiced disaster recovery involving third parties.

“I think we all should do it, because in this event we needed a lot of third parties to help us recover. And they were excellent. We formed as one team. They were never looking at the contracts to see what they said. We worked as one team and recovered.”

The third parties quickly understood the magnitude of the problem and stood up their own recovery teams dedicated to the UniSuper incident.

Krishnan says he’s surprised how easy the collaboration was.

“We never expected it to be so easy. Everyone said ‘we are in this together and we will recover together’.

That extended to UniSuper’s board, senior executives and management, who rather than pushing the ‘how fast can systems be restored’ angle, reiterated that ‘we are going to work together and recover’.

“That assurance, that confidence, greatly helped out teams. We were all working hard and we don’t need pressure from senior management asking when are you going to restore it. It was about doing the best we could.”

He says crisis management teams should include the board because some cyber events, such as ransomware, require the boar to make a big call – to pay or not to pay.

“So having that practice, getting them involved, is very important.”

On the comms front, Krishnan says while external communications was ‘excellent, very good’, internal comms needed improvement.

“We have a lot of stakeholders internally and the level of information and content of that information is different for the different stakeholder groups. We haven’t practiced that and it was one of the things we learned.”

Krishnan also pointed to the importance of clear, concise, accurate and consistent communications.

While members’ investments were ‘totally safe’ as they are hosted in another third party, and the investment team continued to operate as normal throughout the issues, members were anxious – and unable to check balances, clear deposits and withdrawals and watch stock markets to manage their investments.

“What we learned and what we practiced is transparency is key. We have to be transparent with our members and external parties to explain to them what actually happened. And not just transparency: The information needs to be accurate as well.”

Ensuring communications were consistent across all channels to prevent misinformation or panic, was also necessary, as was timely comms.

UniSuper stood up a 24/7 call centre for the event, and Krishnan says empathy and reassurance was the name of the game.

“You have to reassure them not just by words, you have to show them you have a clear plan, this is what we are doing now, this is the next step, to give some confidence to them.”

Krishnan says the outage ‘reiterated most of our thinking, with some adjustments required’.

“The key thing is having a robust resilient architecture especially for core or member or customer facing systems s important. Multi-site, multi-cloud redundancy is very important because even systems that are 99.99% available can do down. There could be events beyond your control. Have a plan in place and design your systems for it,” he says.

“And have a strong disaster recovery plan. Please don’t do disaster recovery for the sake of ticking the box. You have to have a disaster recovery for a black swan.”

Post a comment or question...

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Processing...
Thank you! Your subscription has been confirmed. You'll hear from us soon.
Follow iStart to keep up to date with the latest news and views...
ErrorHere