INDEX
Explanations
mentions of severe consequences or impactful actions
New Auto-Interp
Negative Logits
ice
-0.15
ège
-0.15
fo
-0.14
-in
-0.14
Pap
-0.13
Fund
-0.13
icious
-0.13
fre
-0.13
archy
-0.13
å°
-0.13
POSITIVE LOGITS
PCP
0.18
ôt
0.17
ãĤħ
0.16
/WebAPI
0.16
sil
0.15
ê·¹
0.15
Morales
0.15
wert
0.15
rog
0.15
ques
0.14
Activations Density 0.022%