Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-2317

Pkgpanda lack of retries when downloading packages

    Details

      Description

      Deployments on Azure sometimes fail due to pkgpanda not being able to download some of the packages. This is caused by Azure transient temporary DNS failures.

      This is the dcos-setup log from a failed pkgpanda initialization: http://paste.openstack.org/show/8wKDHAIwYip7uARrbKOQ/

      In the above log, it can be observed on line 10 and line 18 that pkgpanda failed to download the package due to:

      Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

      These are some temporary DNS failures encountered in the Azure environment.

      Digging further into pkgpanda code, I found the exact function call responsible to download the packages: https://github.com/dcos/dcos/blob/master/pkgpanda/util.py#L85

      When downloading from an external source, it's always a best practice to add retries. This way, transient networking problems or temporary DNS failures will not cause the pkgpanda initialization to fail.

      I suggest adding some kind of retry logic here in order to avoid deployment failures. In my case, the Linux masters were not properly initialized and the whole deployment failed: http://paste.openstack.org/show/lPnpnsUxE6MFwga8bA4F 

       

        Attachments

          Activity

            People

            • Assignee:
              branden Branden Rolston
              Reporter:
              ionutbalutoiu ionutbalutoiu
              Team:
              Cluster Ops Team
              Watchers:
              Branden Rolston, ionutbalutoiu, Miguel Bernadin (Inactive)
            • Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: