INDEX
Explanations
phrases indicating surprise or disbelief
expressions of disbelief or the need for assistance
New Auto-Interp
Negative Logits
rongh
-0.74
rane
-0.69
owa
-0.67
heading
-0.62
ighth
-0.62
azel
-0.60
isk
-0.58
bush
-0.57
eport
-0.57
chwitz
-0.57
POSITIVE LOGITS
anymore
0.65
âĶľ
0.65
Louie
0.65
Tex
0.62
Surprise
0.60
aughs
0.60
uitous
0.60
Presents
0.59
Vaugh
0.59
sth
0.58
Activations Density 0.083%