INDEX
Explanations
references to religious figures and titles
New Auto-Interp
Negative Logits
Atra
-0.67
Egli
-0.67
eraus
-0.65
rateful
-0.65
Bris
-0.63
ali
-0.61
Atra
-0.61
vrons
-0.61
Dalla
-0.60
]];
-0.59
POSITIVE LOGITS
Lord
2.16
LORD
1.97
lord
1.96
Lord
1.93
Lords
1.78
LORD
1.69
lords
1.63
lord
1.55
Seigneur
1.28
lords
1.19
Activations Density 0.024%