INDEX
Explanations
expressions of happiness and positive emotions
New Auto-Interp
Negative Logits
aload
-0.15
ROP
-0.14
opi
-0.14
evin
-0.14
hti
-0.14
anik
-0.14
eg
-0.13
oss
-0.13
oping
-0.13
ain
-0.13
POSITIVE LOGITS
about
0.21
to
0.18
overall
0.17
overall
0.17
kul
0.16
ä¹İ
0.15
ritel
0.15
Ñĥв
0.15
About
0.15
irty
0.15
Activations Density 0.043%