INDEX
Explanations
phrases that indicate potential risks or threats to health and the environment
New Auto-Interp
Negative Logits
é³´
-0.16
ovsky
-0.16
rames
-0.15
lej
-0.15
hi
-0.15
uir
-0.14
UILT
-0.14
inho
-0.14
nee
-0.14
éł
-0.14
POSITIVE LOGITS
threat
0.25
idon
0.24
pose
0.21
threat
0.18
risks
0.17
threats
0.17
danger
0.17
Threat
0.17
questions
0.17
Danger
0.17
Activations Density 0.018%