INDEX
Explanations
phrases related to policy and government actions
phrases related to safety and well-being
New Auto-Interp
Negative Logits
confir
-0.70
ende
-0.70
misunder
-0.65
dismant
-0.62
destro
-0.62
Learns
-0.59
OSP
-0.59
ather
-0.59
Rampage
-0.58
reluct
-0.58
POSITIVE LOGITS
ibel
0.63
into
0.62
innocent
0.61
oths
0.58
Inn
0.58
evil
0.56
rum
0.56
Narr
0.56
rily
0.55
metadata
0.55
Activations Density 0.767%