INDEX
Explanations
comparisons expressed with the word "like."
New Auto-Interp
Negative Logits
antis
-0.18
GED
-0.16
ãģ£ãģį
-0.14
mony
-0.14
timeofday
-0.14
opolitan
-0.14
istrovstvÃŃ
-0.14
osten
-0.14
_NV
-0.14
phóng
-0.14
POSITIVE LOGITS
üp
0.17
uto
0.15
/to
0.14
manner
0.14
con
0.14
bil
0.14
sg
0.14
oci
0.13
uten
0.13
Nav
0.13
Activations Density 0.038%