INDEX
Explanations
discussions surrounding threats and their implications on safety
New Auto-Interp
Negative Logits
ovsky
-0.16
égor
-0.15
argas
-0.15
565
-0.14
Victims
-0.14
532
-0.14
alez
-0.14
artin
-0.14
otos
-0.14
tmpl
-0.14
POSITIVE LOGITS
threat
0.59
threats
0.52
threat
0.50
Threat
0.49
å¨ģ
0.49
-threat
0.49
Th
0.48
danger
0.42
threatening
0.40
TH
0.40
Activations Density 0.157%