INDEX
Explanations
percentage values in the text
New Auto-Interp
Negative Logits
unan
-0.17
ambi
-0.16
acin
-0.15
onet
-0.15
onaut
-0.15
stown
-0.15
mant
-0.15
deer
-0.15
HO
-0.15
spiel
-0.14
POSITIVE LOGITS
elow
0.16
ilio
0.16
ventional
0.15
Ñıж
0.15
atus
0.14
infeld
0.14
Dot
0.14
596
0.13
ourses
0.13
blow
0.13
Activations Density 0.004%