1. 13 Aug, 2018 1 commit
  2. 10 Aug, 2018 1 commit
  3. 04 Aug, 2018 1 commit
  4. 01 Aug, 2018 1 commit
  5. 30 Jul, 2018 1 commit
  6. 28 Jul, 2018 1 commit
  7. 25 Jul, 2018 1 commit
  8. 18 Jul, 2018 2 commits
  9. 15 Jul, 2018 1 commit
  10. 07 Jul, 2018 1 commit
  11. 05 Jul, 2018 1 commit
  12. 06 Jun, 2018 2 commits
  13. 09 May, 2018 1 commit
  14. 08 May, 2018 1 commit
  15. 05 May, 2018 1 commit
  16. 30 Apr, 2018 1 commit
  17. 26 Apr, 2018 1 commit
  18. 25 Apr, 2018 1 commit
  19. 05 Apr, 2018 2 commits
  20. 31 Mar, 2018 2 commits
  21. 23 Mar, 2018 1 commit
    • Corentin Chary's avatar
      consul: improve consul service discovery (#3814) · 60dafd42
      Corentin Chary authored
      * consul: improve consul service discovery
      
      Related to #3711
      
      - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
        allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
        Tags and nore-meta are also used in `/catalog/service` requests.
      - Do not require a call to the catalog if services are specified by name. This is important
        because on large cluster `/catalog/services` changes all the time.
      - Add `allow_stale` configuration option to do stale reads. Non-stale
        reads can be costly, even more when you are doing them to a remote
        datacenter with 10k+ targets over WAN (which is common for federation).
      - Add `refresh_interval` to minimize the strain on the catalog and on the
        service endpoint. This is needed because of that kind of behavior from
        consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog
        on a large cluster would basically change *all* the time. No need to discover
        targets in 1sec if we scrape them every minute.
      - Added plenty of unit tests.
      
      Benchmarks
      ----------
      
      ```yaml
      scrape_configs:
      
      - job_name: prometheus
        scrape_interval: 60s
        static_configs:
          - targets: ["127.0.0.1:9090"]
      
      - job_name: "observability-by-tag"
        scrape_interval: "60s"
        metrics_path: "/metrics"
        consul_sd_configs:
          - server: consul.service.par.consul.prod.crto.in:8500
            tag: marathon-user-observability  # Used in After
            refresh_interval: 30s             # Used in After+delay
        relabel_configs:
          - source_labels: [__meta_consul_tags]
            regex: ^(.*,)?marathon-user-observability(,.*)?$
            action: keep
      
      - job_name: "observability-by-name"
        scrape_interval: "60s"
        metrics_path: "/metrics"
        consul_sd_configs:
          - server: consul.service.par.consul.prod.crto.in:8500
            services:
              - observability-cerebro
              - observability-portal-web
      
      - job_name: "fake-fake-fake"
        scrape_interval: "15s"
        metrics_path: "/metrics"
        consul_sd_configs:
          - server: consul.service.par.consul.prod.crto.in:8500
            services:
              - fake-fake-fake
      ```
      
      Note: tested with ~1200 services, ~5000 nodes.
      
      | Resource | Empty | Before | After | After + delay |
      | -------- |:-----:|:------:|:-----:|:-------------:|
      |/service-discovery size|5K|85MiB|27k|27k|27k|
      |`go_memstats_heap_objects`|100k|1M|120k|110k|
      |`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
      |`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
      |`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
      |`process_open_fds`|16|*1236*|22|22|
      |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
      |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
      |`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
      |Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|
      
      Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
      Being a little bit smarter about this reduces the overhead quite a lot.
      Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
      
      * consul: tweak `refresh_interval` behavior
      
      `refresh_interval` now does what is advertised in the documentation,
      there won't be more that one update per `refresh_interval`. It now
      defaults to 30s (which was also the current waitTime in the consul query).
      
      This also make sure we don't wait another 30s if we already waited 29s
      in the blocking call by substracting the number of elapsed seconds.
      
      Hopefully this will do what people expect it does and will be safer
      for existing consul infrastructures.
      60dafd42
  22. 09 Mar, 2018 1 commit
  23. 08 Mar, 2018 1 commit
  24. 24 Feb, 2018 1 commit
  25. 21 Feb, 2018 3 commits
    • Conor Broderick's avatar
      99006d3b
    • Conor Broderick's avatar
      1fd20fc9
    • Bartek Plotka's avatar
      api: Added v1/status/flags endpoint. (#3864) · 93a63ac5
      Bartek Plotka authored
      Endpoint URL: /api/v1/status/flags
      Example Output:
      ```json
      {
        "status": "success",
        "data": {
          "alertmanager.notification-queue-capacity": "10000",
          "alertmanager.timeout": "10s",
          "completion-bash": "false",
          "completion-script-bash": "false",
          "completion-script-zsh": "false",
          "config.file": "my_cool_prometheus.yaml",
          "help": "false",
          "help-long": "false",
          "help-man": "false",
          "log.level": "info",
          "query.lookback-delta": "5m",
          "query.max-concurrency": "20",
          "query.timeout": "2m",
          "storage.tsdb.max-block-duration": "36h",
          "storage.tsdb.min-block-duration": "2h",
          "storage.tsdb.no-lockfile": "false",
          "storage.tsdb.path": "data/",
          "storage.tsdb.retention": "15d",
          "version": "false",
          "web.console.libraries": "console_libraries",
          "web.console.templates": "consoles",
          "web.enable-admin-api": "false",
          "web.enable-lifecycle": "false",
          "web.external-url": "",
          "web.listen-address": "0.0.0.0:9090",
          "web.max-connections": "512",
          "web.read-timeout": "5m",
          "web.route-prefix": "/",
          "web.user-assets": ""
        }
      }
      ```
      Signed-off-by: 's avatarBartek Plotka <bwplotka@gmail.com>
      93a63ac5
  26. 19 Feb, 2018 1 commit
    • Pedro Araújo's avatar
      Add OS type meta label to Azure SD (#3863) · 575f6659
      Pedro Araújo authored
      There is currently no way to differentiate Windows instances from Linux
      ones. This is needed when you have a mix of node_exporters /
      wmi_exporters for OS-level metrics and you want to have them in separate
      scrape jobs.
      
      This change allows you to do just that. Example:
      
      ```
        - job_name: 'node'
          azure_sd_configs:
            - <azure_sd_config>
          relabel_configs:
            - source_labels: [__meta_azure_machine_os_type]
              regex: Linux
              action: keep
      ```
      
      The way the vendor'd AzureSDK provides to get the OsType is a bit
      awkward - as far as I can tell, this information can only be gotten from
      the startup disk. Newer versions of the SDK appear to improve this a
      bit (by having OS information in the InstanceView), but the current way
      still works.
      575f6659
  27. 12 Feb, 2018 1 commit
  28. 11 Feb, 2018 1 commit
  29. 26 Jan, 2018 1 commit
  30. 24 Jan, 2018 1 commit
  31. 08 Jan, 2018 1 commit
  32. 26 Dec, 2017 1 commit
  33. 23 Dec, 2017 1 commit
  34. 19 Dec, 2017 1 commit
    • James Turnbull's avatar
      Updated alert templating docs (#3596) · c3f92387
      James Turnbull authored
      The docs suggest that alert templating only works in the summary and
      description annotation fields. Some testing and a review of the code
      suggests this is no longer true and that you can template any
      annotation field.
      c3f92387