Uploaded image for project: 'DC/OS'
  1. DC/OS
  2. DCOS_OSS-3683

Agent node does not re-join the cluster after reboot for a standard CF template based cluster

    Details

    • Component Version:
    • Sprint:
      CLI Team Sprint 24
    • Story Points:
      2

      Description

      Just to elaborate a bit.

      The CF template specifies two EBS drives for the masters: 

      "MasterLaunchConfig" : {
      [...]
      "BlockDeviceMappings" : [ { "DeviceName" : "/dev/xvda", "Ebs" : { "VolumeSize": 150, "VolumeType": "gp2", "DeleteOnTermination": true} },
      { "DeviceName" : "/dev/xvdb", "Ebs" : { "VolumeSize": 80, "VolumeType": "gp2", "DeleteOnTermination": true} }],
      
      

      But only one for the Agent nodes:

      "SlaveLaunchConfig" : {
      [...]
      "BlockDeviceMappings" : [ { "DeviceName" : "/dev/xvda", "Ebs" : { "VolumeSize": 150, "VolumeType": "gp2", "DeleteOnTermination": true} } ],
      
      

       

      But the cloud-config.yaml for the CoreOS nodes expects there to be a /dev/xvdb:

      https://github.com/dcos/dcos/blob/99547a5a2311e706769f5382b8f761c8355d066b/gen/coreos-aws/cloud-config.yaml

      In CoreOS 1632 (not clear if this is an issue on earlier version), the units added by the cloud-config end up part of the systemd dependency string for systemd-timesyncd:

      core@ip-10-0-2-77 ~ $ sudo systemctl list-dependencies systemd-timesyncd
      systemd-timesyncd.service
      ● ├─system.slice
      ● ├─tmp.mount
      ● ├─var-lib.mount
      ● └─time-sync.target
      

      It seems there is some kind of race condition on initial install of these machines - it looks like the time service is able to start before the cloud-config.yaml adds the var-lib.mount unit.  It is only on a reboot that the problem shows up (since the units are on the disk at that point and don't need to wait for cloud-config to lay them down.

      Two possible fixes:  Add a second volume to the slaves (this would use more resource but would match our recommendations to put var/lib on separate spindles) or perhaps only apply the extra cloud-config.yaml to master node. 

        Attachments

          Activity

            People

            • Assignee:
              agrillet Armand Grillet
              Reporter:
              agrillet Armand Grillet
              Team:
              CLI Team
              Watchers:
              Armand Grillet
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: