INDEX
Explanations
references to the concept of "devil" or something devilish
references to the concept of the devil
New Auto-Interp
Negative Logits
uries
-0.79
Ô
-0.76
yles
-0.74
dL
-0.73
ij士
-0.73
atern
-0.72
skirts
-0.70
POR
-0.70
Ģ
-0.69
µ
-0.68
POSITIVE LOGITS
ishly
1.20
incarn
0.92
esses
0.88
worsh
0.82
ESS
0.79
devil
0.79
horns
0.78
ish
0.77
gou
0.77
ibur
0.75
Activations Density 0.009%