INDEX
Explanations
mentions of specific geographic locations and proper nouns
New Auto-Interp
Negative Logits
iros
-0.19
omor
-0.14
ictory
-0.14
rin
-0.14
oded
-0.14
ä¹IJ
-0.14
pta
-0.13
judgement
-0.13
ãĥ³ãĥIJ
-0.13
roken
-0.13
POSITIVE LOGITS
uth
0.27
UTH
0.20
wich
0.19
oxetine
0.18
les
0.17
ces
0.17
mage
0.16
umb
0.15
quer
0.15
rosse
0.15
Activations Density 0.004%