INDEX
Explanations
personal expressions of preference or opinion
New Auto-Interp
Negative Logits
견
-0.16
ãĤ¤ãĥī
-0.15
het
-0.14
oji
-0.14
ITE
-0.14
marvin
-0.14
éĢł
-0.13
kvinne
-0.13
trak
-0.13
ãĤ¦ãĤ©
-0.13
POSITIVE LOGITS
like
0.24
typically
0.22
personally
0.21
usually
0.20
prefer
0.18
recently
0.18
likes
0.17
typically
0.17
agr
0.16
find
0.16
Activations Density 0.072%