INDEX
Explanations
phrases indicating actions of removal or displacement
New Auto-Interp
Negative Logits
oyer
-0.18
inu
-0.15
734
-0.14
obre
-0.14
quo
-0.14
Rs
-0.13
avax
-0.13
bole
-0.13
اÙĪÙĬ
-0.13
opak
-0.13
POSITIVE LOGITS
ropp
0.16
oth
0.15
çĬ¯
0.15
Cause
0.15
orne
0.14
rang
0.14
Duffy
0.14
verv
0.14
Sloan
0.14
üst
0.14
Activations Density 0.123%