INDEX
Explanations
expressions of positive sentiment towards individuals or things
New Auto-Interp
Negative Logits
ãĥ¼ãĥ«ãĥī
-0.16
obic
-0.16
elsey
-0.15
privilege
-0.15
ULL
-0.15
ãĤ¯ãĥŃ
-0.15
tera
-0.14
اÙĦÙħÙĦ
-0.14
stav
-0.14
extr
-0.14
POSITIVE LOGITS
guts
0.19
ays
0.17
ograd
0.15
ermann
0.15
_NT
0.15
astos
0.15
atos
0.15
arkin
0.14
KT
0.14
enty
0.14
Activations Density 0.110%