Uploaded image for project: 'Marathon'
  1. Marathon
  2. MARATHON-4802

Add migration to fix invalid constraints

    Details

    • Sprint:
      Marathon Sprint 1.10-10, Marathon Sprint 1.11-1
    • Story Points:
      2

      Description

      Constraints were tightened in 1.3.0 to disallow an invalid constraint. We need to add a migration to automatically remove them. Investigate if there are any downsides to just deleting the invalid value-less CLUSTER constraint. Last time I looked at this, I recall deciding that this constraint did nothing in 1.1.x when CLUSTER had no value..

      We (@amitbose327 and @miroswan) where currently running Marathon 0.15.3 and wanted to upgrade to Marathon 1.3.6 ( currently the latest).

      After upgrading to the latest Marathon version (1.3.6) and tried a new deployment, faced an HTTP Status Code 422 and any other operation (Scale, Kill, Delete) also returned the same status. Here is the error I was getting:

      {"message":"Object is not valid","details":[
      {"path":"/apps(8)/constraints","errors":["Missing value"]}
      ,
      {"path":"/apps(18)/constraints","errors":["Missing value"]}
      ,
      {"path":"/apps(13)/constraints","errors":["Missing value"]}
      ,
      {"path":"/apps(10)/constraints","errors":["Missing value"]}
      ]}
      

      The request body I was sending had no Constraint field in it. Started looking at Marathon code for 1.3.6 release and found: https://github.com/mesosphere/marathon/blob/v1.3.6/src/main/scala/mesosphere/marathon/state/AppDefinition.scala#L717-L722.

      I found a few applications that did not have rvalues set for the CLUSTER when running a TCPDUMP. Also validated this via the v2/apps Marathon endpoint. These applications where deployed previously (when Marathon did not have the constraint validation code).

      After the upgrade, this validation caused issues for ALL of the applications in the cluster instead of just the misconfigured applications. Tried deleting the misconfigured apps, but couldn't make adjustments to the applications in the cluster, as the "Object not valid" message was always thrown.

      Since we understood which error in the new Marathon code was causing the problem, we decided to move back to the last version (v1.1.5) that did NOT have this validation in it causing the exception. This did the trick.

      To conclude, if I have some apps that where already deployed earlier (pre Marathon 1.3.x) that do not satisfy the CLUSTER validation check, after the upgrade to 1.3.x I see the "Object not valid" error when I try any operation (PUT, DELETE, etc) with any other app, that does not have any validation problem.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ivanchernetsky Ivan Chernetsky
                Reporter:
                GitHub_amitbose327 Amit Bose (Inactive)
                Team:
                Orchestration Team
                Watchers:
                Ivan Chernetsky, Jason Gilanfarr (Inactive), Marco Monaco, Tim Harper
              • Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: