INDEX
Explanations
phrases that introduce examples or instances
New Auto-Interp
Negative Logits
ija
-0.14
indeed
-0.14
para
-0.14
ico
-0.14
idi
-0.14
iglia
-0.14
aps
-0.14
AT
-0.13
edo
-0.13
æģ¯
-0.13
POSITIVE LOGITS
sake
0.27
purposes
0.24
:
0.18
:↵
0.16
ãģĪãģ°
0.16
èĢĮ
0.15
orz
0.15
když
0.15
æĿ¥è¯´
0.14
forth
0.14
Activations Density 0.032%