INDEX
Explanations
conversational dialogues
expressions of disbelief or surprise
New Auto-Interp
Negative Logits
unquestion
-0.63
undeniably
-0.62
utterstock
-0.58
virt
-0.58
strikingly
-0.57
respective
-0.56
predictably
-0.55
unsurprisingly
-0.53
uniformly
-0.53
markedly
-0.52
POSITIVE LOGITS
fuckin
0.95
gonna
0.92
haha
0.81
wanna
0.79
-"
0.77
kinda
0.76
laughs
0.74
gotta
0.74
ya
0.74
hin
0.74
Activations Density 0.975%