DRP organisation

DRP (Disaster Recovery Plan) is the exercise performed by the IT teams to validate the resilience of the Bank IT asset.

Context

A Bank is a critical pillar of a country's economy and its IT is at the foundation of this pillar.

In case of major incident, IT has to must be ready to perform fix actions in the best delays . Banks, at least French ones, have the legal obligation to test their IT resilience at least 1 time by datacenter they own.

It consists in stopping a part of the infrastructure (example : datacenter power down) to simulate a real Infra unavailability. It permits to see if the critical services can to stay available despite the crash that occured (for exemple switching from a stopped datacenter to another available one).

It usually occurs during a week-end and takes place in two steps:

- FailOver : Intentional stop of the infrastructure we wish to test.

- FailBack : Return to the normal situation by restarting all entities precedently stopped.

Working in the Middleware/Container team, I have been responsible of the biggest Datacenter DRP organisation in all the Middleware team (for SGCIB) before, during, and after the event.

Team Work

Lead a Project

Production

Infrastructure Knowledge

The Project

Before

I had to prepare the DRP for my team:

- Analyse and create and up2date inventory of the impacted assets.

- Communicate with the control tower, and with the differents Middleware team

- Take care of the logistic before the event ( Order of the meal tray, Planning of the Week-end depending on my coworkers’ availibity.

- Check that all DRP tools will be available during the event

During

I had to take care of the good proceeding of the event:

- Manage the priorities of the incidents

- Manage my coworkers (Stress, Customers requests managment)

- It means keep more than everything communication with the coworkers to work as one efficient team.

- Communicate with Control tower

Here my role was double:

- Take distance from the production (have a global view) of the event. This required to manage the event as a project and be the Team work pillar.

- Help on what needs to be fixed regarding the issue priority. This required to have a good understanding of the production concerpt and a good knowledge of the assets we manage.

After

After this event I had to create Incidents Post Mortems if needed to keep a trace of it and explain how it had been definitely solved.

A great Project I participed to for sure!

I learnt a lot during this project and I had the opportunity to manage others since then.