INDEX
Explanations
phrases indicating contrasts or alternatives in contexts
New Auto-Interp
Negative Logits
oria
-0.18
erable
-0.16
ior
-0.15
raki
-0.15
νοÏį
-0.15
rak
-0.14
istant
-0.14
odies
-0.14
ijing
-0.14
Cul
-0.14
POSITIVE LOGITS
ISCO
0.15
endar
0.14
ileaks
0.14
nÄĥ
0.13
878
0.13
íĦ¸
0.13
assist
0.13
spending
0.13
lom
0.13
iao
0.13
Activations Density 0.907%