INDEX
Explanations
direct address or references to an individual in conversations
New Auto-Interp
Negative Logits
éd
-0.16
ocy
-0.15
-0.14
fucked
-0.14
orgh
-0.14
ustry
-0.14
inus
-0.14
grily
-0.14
ynchronously
-0.14
ergus
-0.14
POSITIVE LOGITS
sound
0.26
obviously
0.21
two
0.19
sir
0.19
said
0.19
ok
0.19
mentioned
0.19
ever
0.18
mean
0.18
poor
0.17
Activations Density 0.117%