INDEX
Explanations
instances of emotional reactions and social observations
New Auto-Interp
Negative Logits
its
-0.20
here
-0.16
nt
-0.15
ade
-0.15
....
-0.15
ve
-0.14
bazen
-0.14
....↵
-0.14
2
-0.14
ge
-0.14
POSITIVE LOGITS
period
0.18
haha
0.17
PLUS
0.17
sans
0.17
Ãł
0.17
er
0.17
lol
0.16
LOL
0.16
ha
0.16
eh
0.16
Activations Density 0.224%