1. 15 Oct, 2019 1 commit
  2. 10 Oct, 2019 1 commit
  3. 25 Sep, 2019 1 commit
  4. 20 Sep, 2019 1 commit
    • beorn7's avatar
      Check pushed metrics immediately and reject them if inconsistent · a112ae50
      beorn7 authored
      This is finally a solution to #202 (and as a byproduct to #94).
      
      The original wisdom used to be that checking pushed metrics during the
      push should not be done because it is inherently an O(n*m) problem: n
      is the number of metrics in the PGW, and m is the number of pushes in
      a certain time-frame. Upon each push, a check has to run, and that
      check has a cost linear with the total number of metrics. Thus, if you
      scale up something by a factor of 2, you can expect twice as many
      pushers pushing in total twice as many metrics, in twice as many push
      operations, resulting in 2*2=4 times the cost of checking
      metrics. Since we didn't want to let pushers wait for a potentially
      expensive check, we just accepted the push, as long as it was
      parseable, and deferred the error detection to scrapes.
      
      The result was that a single inconsistent push would result in an
      error when scraping the PGW. The bright side here is that the problem
      will be spotted quite easily, and there is a strong incentive to fix
      it. The problematic part is that a single rogue pusher is affecting
      all systems that depend on the same PGW, possibly at inconvenient
      times.
      
      In sane use cases for the PGW, scrapes are happening more often than
      pushes. Let's say a database backup pushes its success or failure
      outcome once per day. Scrapes happen a couple of times per
      minute. While the whole O(n*m) calculation above is not wrong, the
      reality is that the scrapes will inflict the cost anyway. If the check
      is too expensive per push, it will be even more so per scrape. Also,
      pushers shouldn't be too sensitive to slightly enlarged push
      times. Finally, if a PGW is so heavily loaded with metrics that the
      check actually takes prohibitively long, we can be quite sure that
      this PGW is abused for something it was not meant for (like turning
      Prometheus into a push-based system).
      
      The solution is conceptually quite easy: On each push, first feed the
      push into a copy of the current metrics state (which has a certain
      cost but it is anyway done whenever somebody looks at the web UI) and
      then simulate a scrape. In that way, we can reuse the validation power
      of prometheus/client_golang. We don't even need to rearchitecture the
      whole framework of queueing pushes and deletions. We only need to talk
      back via an optional channel. Pushes will now get a 200 or a 400
      rathen than the "(almost) always 202" response we had before.
      
      Easy, isn't it?
      
      Unfortunately, the easy change became _way_ more invasive than I
      anticipated. Here is an overview of the changes in this commit (also
      see the changes in README.md, which explain the usage side of things):
      
      - Previously, the PUT request was executed in two parts: First, delete
        all metrics in the group, then add the newly pushed metrics to the
        group. That's now a problem. If the 2nd part fails, it will leave
        behind an empty group rather than the state before the whole PUT
        request, as the user will certainly expect. Thus, the operation had
        to be made atomic. To accomplish that, the `storage.WriteRequest`
        now has an additional field `Replace` to mark replacement of the
        whole group if true. This field is in addition to the back channel
        mentioned above (called `Done` in the `WriteRequest`).
      
      - To enable alerting on failed pushes, there is now a new metric
        `push_failure_time_seconds` with the timestamp of the last failed
        push (or 0 if there never was a failed push). Ideally, we would
        rename the existing metric `push_time_seconds` to
        `push_success_time_seconds`, but that would break all existing uses
        of that metric. Thus, I left the name as is, although
        `push_time_seconds` is only updated upon a successful push. (This
        allows alerting on `push_failure_time_seconds > push_time_seconds`
        and tracks at the same time the last successful push.) There are
        some subtle aspects of implementing this correctly, mostly around
        the problem that a successful push should leave a pre-existing
        failure timestamp intact and vice versa.
      
      - Previously, the push handler code already had a bunch of validation
        and sanitation code (block timestamps, sanitize labels) and also
        added the `push_time_seconds` metric. Now that the storage layer is
        doing the consistency check for the pushed metric, it is way cleaner
        to have all the other validation and sanitation code there as
        well. (Arguably, it was already wrong to have too much logic in the
        push handler.) Also, the decision needed to properly set the
        `push_time_seconds` and `push_faiure_time_seconds` metrics can now
        only be made after the consistency check, so those metrics have to
        be set in the storage layer anyway. This change of responsibility
        also changed the tests quite a bit. The push handler tests now test
        only that the right method calls happen to the storage, while the
        many tests for validation, consistency checks, adding the push
        timestamp metrics, and proper sanitation are now part of the storage
        tests. (Digression: One might argue that the cleanest solution is a
        separate validation layer in front of a dumb storage layer. Since
        the only implementation of the storage for now is the disk storage
        (and a mock storage for testing), I decided to not make that split
        prematurely.)
      
      - Logging of rejected pushes happens ot error level now. Strictly
        speaking, it is just a user error and not an error of the PGW
        itself, if it is happening. But experience tells us that users first
        check the logs, and they usualyl don't run on debug or even info
        level. I hope this will avoid a whole lot of user support questions.
        Groups where the last push has failed are also marked in the web UI.
      Signed-off-by: default avatarbeorn7 <beorn@grafana.com>
      a112ae50
  5. 09 Sep, 2019 1 commit
    • beorn7's avatar
      Update dependencies · d1e8f2b8
      beorn7 authored
      Also, add a go version line, as this tends to be added automatically
      by tools, but then it adds the Go version currently used, while
      Pushgateway should just build fine with older Go versions. (Specifying
      anything earlier than Go1.11 doesn't make sense, though, as Go modules
      weren't supported previously.)
      Signed-off-by: default avatarbeorn7 <beorn@grafana.com>
      d1e8f2b8
  6. 01 Aug, 2019 1 commit
  7. 28 Jun, 2019 1 commit
  8. 17 Jun, 2019 1 commit
  9. 09 May, 2019 1 commit
  10. 13 Apr, 2019 1 commit
  11. 07 Dec, 2018 1 commit
  12. 16 Nov, 2018 1 commit