INDEX
Explanations
phrases related to questioning someone's actions or appearance
New Auto-Interp
Negative Logits
unsurprisingly
-0.78
anecd
-0.76
strikingly
-0.73
ideally
-0.73
uably
-0.70
surprisingly
-0.69
yrinth
-0.68
tantal
-0.67
markedly
-0.67
pmwiki
-0.67
POSITIVE LOGITS
fuckin
1.16
.'"
1.04
gonna
1.03
fucking
1.03
!'"
1.01
'."
1.00
..."
0.99
â̦"
0.96
â̦"
0.96
-"
0.95
Activations Density 0.920%