Atlassian has attributed a so-far eight-day outage of its services for around 400 customers to a "communication gap" between engineering teams and a "faulty" script that permanently deleted customer data.
Now that the company is progressing in restoring the deleted customer sites from backup, it has published a more detailed writeup that it promised earlier today.
The seeds of the outage were sewn when Atlassian folded a standalone product, Insight – Asset Management - into its Jira Software and Jira Service Management as native functionality.
“Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed”, CTO Sri Viswanath wrote.
He said the engineering teams decided to use an existing script to “deactivate instances of this standalone application”.
That turned out to be a disaster.
A miscommunication between two engineering teams – one asking for the deactivation of the instances, the other executing it – meant that instead of running the script against “the IDs of the intended app being marked for deactivation”, it was run with “the IDs of the entire cloud site where the apps were to be deactivated”.
The other mistake: the script could be asked to mark sites for deletion (which provides recoverability), or to be “permanently deleted”.
“The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted,” Viswanath wrote.
The reason behind the extended outage
Given the nature of its business, Atlassian had those sites backed up and able to be restored.
That is something that happens when individual customers accidentally delete their own environments, and in the event of a catastrophic failure, the backups can restore all customers into a new environment.
However, the deletion of 400 customers’ sites presented Atlassian with a new scenario.
“What we have not (yet) automated is restoring a large subset of customers into our existing (and currently in use) environment without affecting any of our other customers,” Viswanath explained.
“Because the data deleted in this incident was only a portion of data stores that are continuing to be used by other customers, we have to manually extract and restore individual pieces from our backups.
“Each customer site recovery is a lengthy and complex process, requiring internal validation and final customer verification when the site is restored.”
At the moment, Viswanath wrote, customers are being restored in batches of 60, with a four-to-five day end-to-end restore time for each customer.
This is speeding up: “Our teams have now developed the capability to run multiple batches in parallel, which has helped to reduce our overall restore time”, the post stated.