INDEX
Explanations
words in a specific non-English language
New Auto-Interp
Negative Logits
amma
-0.14
roj
-0.14
leh
-0.13
بت
-0.13
lap
-0.13
dro
-0.13
ate
-0.13
al
-0.13
.prop
-0.13
mer
-0.13
POSITIVE LOGITS
ppard
0.17
hiba
0.16
isclosed
0.16
ofday
0.15
PÅĻed
0.15
follando
0.15
slog
0.15
esktop
0.15
Sloan
0.14
Vance
0.14
Activations Density 0.111%