Failed DNS server restarts caused Salesforce outage

By on
Failed DNS server restarts caused Salesforce outage

Configuration change "exposed a design issue in the shutdown process".

Domain name servers that did not restart as expected after a configuration change caused Salesforce's services to go down worldwide on May 12, the company said in a final root cause analysis of the incident.

On that day, "a configuration change was made as an emergency fix at the network tier, which was designed to address a functional gap in preparation for an upcoming maintenance activity," Salesforce said.

Salesforce use the Berkely Internet Name Daemon (BIND) software.

A change was made to enable DNS resolution between an existing Salesforce Australia data centre and a new Hyperforce environment set to undergo maintenance, using a script.

The script, which Salesforce says has been used in the past three years without ill effects, used an internal method called Metazone change. 

This deploys new configuration data through a DNS zone transfer, but in the May 12 incident the script did not behave as expected.

A UNIX operating system KILL command did not wait long for the BIND named process to exit cleanly or to remove a process identification (PID) file.

On restart, the named startup script checks for an existing PID to determine if an instance is already running.

If the script finds a PID file, it exits immediately, and as a result, the named DNS server process did not restart.

Salesforce said the script failure had global impact because the Metazone changes were deployed to named servers across all its data centres worldwide.

Many named services failed to restart, causing widespread disruption for Salesforce customers.

A lack of automation with safeguards for DNS changes to protect against unforeseen incidents was a contributing factor for the outage, along with insufficient guardrails to enforce the change management process.

Saleforce's Sales, Service, Marketing, Commerce, Government and Experience Clouds all became inaccessible for users, along with Heroku, Pardot and Industries.

Adding to Salesforce customers' woes, the status.salesforce.com site experienced such high traffic that it, too, became unavailable.

Customers were also unable to log support cases due to multi-factor authentication problems.

Salesforce has apologised for the outage, and has put a moratorium in place for all DNS change across the company.

The script that triggered the outage has also been removed.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

Log In

Email:
Password:
  |  Forgot your password?