INDEX
Explanations
references to religious texts and figures
New Auto-Interp
Negative Logits
?s
-0.20
�s
-0.19
ÂŃs
-0.17
Âĸ
-0.17
ÂĹ
-0.17
>NN
-0.16
$s
-0.15
ÂŃn
-0.15
ÂŃt
-0.15
�
-0.15
POSITIVE LOGITS
’
0.45
'
0.44
Ê
0.41
ÑĬ
0.30
‘
0.30
`
0.29
Ь
0.28
'\
0.27
â̲
0.26
ÑĮ
0.26
Activations Density 0.132%