INDEX
Explanations
references to surges or increases in context
New Auto-Interp
Negative Logits
f
-0.18
orable
-0.18
velop
-0.17
arih
-0.17
lords
-0.17
urator
-0.16
fuse
-0.16
جاÙħ
-0.16
\<^
-0.15
ryption
-0.15
POSITIVE LOGITS
ging
0.24
charges
0.22
feit
0.21
mount
0.20
rog
0.19
tÃŃ
0.19
ges
0.19
r
0.19
tir
0.18
er
0.18
Activations Density 0.004%