選 Loki 做監控主要是動物機架一個 Web 服務,天天查看 log 都有被攻擊跡象。我使用 Loki LogQL 去做查詢,雖然有做一層 網域才顯示內容,前正子使用 Autleia 做 雙因子驗證,更久以前還有設定 iptable 台灣 IP 才能連,但現在有做 TLS 關係,所以不能擋 IP,不知道 Docker 能不能跟 fail2ban 做結合。最近也有看 traefik plugnin 也有fail2ban ,改天有空也能研究看看。
這兩天還攻擊滿可觀的。
查詢不友善的IP訪問,這邊我之前寫的 LogQL,暫時放在這邊。之前想特別紀錄,但久久就忘記規則怎麼用,但所幸參考範例改一改還是能試出來。
1
|
{container_name="rpi-traefik_traefik_1",host="PI202"} | json | ClientHost!~"192.168.+|排除IP" RequestPath != "/"
|
1
|
{container_name="rpi-traefik_traefik_1",host="PI202"} | json | ClientHost!~"192.168.+|排除IP" RequestPath != "/" DownstreamStatus != "404" | line_format "{{.ClientHost}}"
|
本章節(筆記)還是重點還是放在,Open Beta 主機架設 Kubernetes 怎麼抓 Loki log資訊做監控。
流程
flowchart LR
Promtail ----> Loki
Promtail -- metrics 增加 prometheus 資料 --> Prometheus
Prometheus --> AlertManager
主要參照使用 Loki 进行日志监控和报警-阳明的博客|Kubernetes|Istio|Prometheus|Python|Golang|云原生文章,操作中發現現有版本設定有改,但修正也能正常使用。
實作流程
參考文章
設定 loki-stack.yml(helm values.yml)
主要重點在這段
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
promtail:
enabled: true
pipelineStages:
- match:
selector: '{app="nginx"}'
stages:
- regex:
expression: '.*(?P<hits>GET /.*)'
- metrics:
nginx_hits:
type: Counter
description: "Total nginx requests"
source: hits
config:
action: inc
|
pipelineStages
可以看相關看官方文件,他這個有點像pipeline 交給後續處理,前面處理失敗,所以後面就不會監控到,可以查看 Grafana 的log 格式調整設定檔內容。可參考:Stages | Grafana Labs
loki-stacl.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
|
loki:
enabled: true
promtail:
enabled: true
pipelineStages:
- docker: {}
- match:
selector: '{app="nginx"}'
stages:
- regex:
expression: '.*(?P<hits>GET /.*)'
- metrics:
nginx_hits:
type: Counter
description: "Total nginx requests"
source: hits
config:
action: inc
fluent-bit:
enabled: false
grafana:
enabled: true
sidecar:
datasources:
enabled: true
image:
tag: 8.3.5
prometheus:
enabled: true
server:
persistentVolume:
enabled: false
alertmanager:
persistentVolume:
enabled: false
filebeat:
enabled: false
filebeatConfig:
filebeat.yml: |
# logging.level: debug
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash-loki:5044"]
logstash:
enabled: false
image: grafana/logstash-output-loki
imageTag: 1.0.1
filters:
main: |-
filter {
if [kubernetes] {
mutate {
add_field => {
"container_name" => "%{[kubernetes][container][name]}"
"namespace" => "%{[kubernetes][namespace]}"
"pod" => "%{[kubernetes][pod][name]}"
}
replace => { "host" => "%{[kubernetes][node][name]}"}
}
}
mutate {
remove_field => ["tags"]
}
}
outputs:
main: |-
output {
loki {
url => "http://loki:3100/loki/api/v1/push"
#username => "test"
#password => "test"
}
# stdout { codec => rubydebug }
}
|
1
|
helm upgrade --install loki grafana/loki-stack -f loki-stack.yml
|
接下來用 Grafana 查詢 Promethus 看有沒有抓到資料,有資料代表成功了。原本以為可能要加時間控制抓5分鐘內,但我後來發現好像不需要做這件事,通知會在超過時間之前就發解決,可能後續我在看文件有沒有預設會抓多久。
架設 alertmanger-discord
建立 deployment 和 service 後,設定好 webhook 就能使用了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager-discord-deployment
labels:
app: alertmanager-discord
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager-discord
template:
metadata:
labels:
app: alertmanager-discord
spec:
containers:
- name: alertmanager-discord
image: benjojo/alertmanager-discord
ports:
- containerPort: 9094
env:
- name: DISCORD_WEBHOOK
value: https://discord.com/api/webhooks/***
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager-discord
spec:
selector:
app: alertmanager-discord
ports:
- protocol: TCP
port: 9094
|
可以kubectl get svc
查看 service 的IP,可以簡單做curl IP:9094
看有沒有失敗訊息,如果有可以確認 Label 有沒有打錯字,可以查看 endpoint 是不是有IP。
設定 alertmanger 通知
可以參考:helm-charts/values.yaml at main · prometheus-community/helm-charts
我查看裡面有Kuberetes@CRDS
可以用,但我這個還不熟悉,之後在接觸。感覺用這個就不需要設定檔案,之前Traefik 有用到 CRDS。
可看prometheus-operator/alerting.md at main · prometheus-operator/prometheus-operator,之後會做簡單嘗試。
其實會寫對照設定還滿簡單的,查看我之前在動物機設定,根本可以抄。
動物機設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
global:
#163服务器
smtp_smarthost: 'smtp.qq.com:465'
#发邮件的邮箱
smtp_from: 'your-email@foxmail.com'
#发邮件的邮箱用户名,也就是你的邮箱
smtp_auth_username: 'your-email@foxmail.com'
#发邮件的邮箱密码
smtp_auth_password: 'your-password'
#进行tls验证
smtp_require_tls: true
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 60m
receiver: discord_webhook
receivers:
- name: 'discord_webhook'
webhook_configs:
- url: 'http://192.168.1.203:9094'
|
loki-stack.yml 的 prometheus 設定
1
2
3
4
5
6
7
8
9
10
11
12
13
|
prometheus:
enabled: true
server:
persistentVolume:
enabled: false
alertmanager:
config:
receivers:
- name: 'discord_webhook'
webhook_configs:
- url: 'http://alertmanger-discord:9094'
persistentVolume:
enabled: false
|
設定 Prometheus 監控自訂的 metrics
官方建議使用 alerting_rules.yml
去做設定,不建議使用alerts
。
參考:helm-charts/values.yaml at main · prometheus-community/helm-charts
但應該也是有相對應 CRD 去做調整。因為我還沒學到那邊,我覺得我先不使用這方式去做。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
prometheus:
enabled: true
server:
persistentVolume:
enabled: false
serverFiles:
alerting_rules.yml:
groups:
- name: example
rules:
- alert: nginx_hits
expr: promtail_custom_nginx_hits
for: 30s
annotations:
summary: "Instance {{ $labels.instance }} nginx hits"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been nginx hits for more than 30 seconds."
alertmanagerFiles:
alertmanager.yml:
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://alertmanager-discord:9094'
persistentVolume:
enabled: false
|
這邊我們達成使用 Loki 針對 Nginx 訪問直接做跳出提醒,做出簡單很廢的實例,這邊可以改寫正規方法,可以抓到自己想要錯誤內容,使用上難度也沒很高。
DEBUG
蒐集網路上一些 debug 方法:
- alertmanager - HackMD
可以打 Alertmanager 測試看看,可以看看 discord 有沒有收到通知。沒有可能要繼續找…,我後來發現是我 alertmanagerFile: 裡面 service 打錯,所以沒有收到通知。😅
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
alerts1='[
{
"labels": {
"alertname": "test name",
"severity" : "Warning"
},
"annotations": {
"Threshold": "<= 2",
"dashboard": "test",
"description":"test",
"infoURL":"test",
"summary":"this is a test mail",
"value":"2"
}
}
]'
curl -XPOST -d "$alerts1" http://<alertmanager url>/api/v1/alerts
|
1
2
3
4
5
6
7
8
9
10
|
receivers:
- name: 'null' # 不做事
- name: 'email'
email_configs: # 指定 email config 來處理
- to: 'xxx@yowko.com'
- name: 'webhook'
webhook_configs: # 指定 webhook config 來處理
- url: 'http://blog.yowko.com'
http_config:
proxy_url: 'http://192.168.80.3:8866' # proxy 是個人 debug 用途
|
最後完成 loki-stack.yml(helm value.yml)
把最後實作重新貼到這邊
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
|
loki:
enabled: true
promtail:
enabled: true
pipelineStages:
- match:
selector: '{app="nginx"}'
stages:
- regex:
expression: '.*(?P<hits>GET /.*)'
- metrics:
nginx_hits:
type: Counter
description: "Total nginx requests"
source: hits
config:
action: inc
fluent-bit:
enabled: false
grafana:
grafana.ini:
server:
root_url: "http://localhost:3000/grafana"
enabled: true
sidecar:
datasources:
enabled: true
image:
tag: 8.3.5
prometheus:
enabled: true
server:
persistentVolume:
enabled: false
serverFiles:
alerting_rules.yml:
groups:
- name: example
rules:
- alert: nginx_hits
expr: promtail_custom_nginx_hits
for: 30s
annotations:
summary: "Instance {{ $labels.instance }} nginx hits"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been nginx hits for more than 30 seconds."
alertmanagerFiles:
alertmanager.yml:
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://alertmanager-discord:9094'
persistentVolume:
enabled: false
filebeat:
enabled: false
filebeatConfig:
filebeat.yml: |
# logging.level: debug
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.logstash:
hosts: ["logstash-loki:5044"]
logstash:
enabled: false
image: grafana/logstash-output-loki
imageTag: 1.0.1
filters:
main: |-
filter {
if [kubernetes] {
mutate {
add_field => {
"container_name" => "%{[kubernetes][container][name]}"
"namespace" => "%{[kubernetes][namespace]}"
"pod" => "%{[kubernetes][pod][name]}"
}
replace => { "host" => "%{[kubernetes][node][name]}"}
}
}
mutate {
remove_field => ["tags"]
}
}
outputs:
main: |-
output {
loki {
url => "http://loki:3100/loki/api/v1/push"
#username => "test"
#password => "test"
}
# stdout { codec => rubydebug }
}
|
1
|
helm upgrade --install loki grafana/loki-stack -f loki-stack.yml
|