INDEX
Explanations
terms and phrases associated with danger and risk
New Auto-Interp
Negative Logits
ãģ¡ãĤĩ
-0.16
лÑı
-0.15
Äįan
-0.15
ebek
-0.14
ijn
-0.14
IFI
-0.14
better
-0.14
éĢł
-0.14
å±
-0.14
pdb
-0.14
POSITIVE LOGITS
-danger
0.23
ously
0.22
dangerous
0.20
éļª
0.19
stell
0.18
ous
0.17
danger
0.17
dangers
0.17
unsafe
0.17
Danger
0.16
Activations Density 0.027%