INDEX
Explanations
phrases that indicate requests or actions directed towards others
New Auto-Interp
Negative Logits
attery
-0.17
NOP
-0.16
geh
-0.16
Ãłi
-0.15
hausen
-0.15
rieve
-0.14
regor
-0.14
Birch
-0.14
askell
-0.14
λον
-0.14
POSITIVE LOGITS
ä¸įè¦ģ
0.17
consider
0.15
imat
0.15
çıį
0.14
stay
0.14
inated
0.14
take
0.14
703
0.14
ikut
0.14
Drop
0.14
Activations Density 0.114%