INDEX
Explanations
phrases indicating personal relationships and emotions
New Auto-Interp
Negative Logits
室
-0.16
ELLOW
-0.15
ucht
-0.15
ataka
-0.15
qli
-0.15
issance
-0.15
raf
-0.14
manship
-0.14
way
-0.13
Dep
-0.13
POSITIVE LOGITS
erge
0.15
lid
0.14
ipher
0.14
any
0.14
iph
0.14
isser
0.14
mere
0.14
ouro
0.14
passes
0.14
Ø£ÙĬ
0.14
Activations Density 0.145%