INDEX
Explanations
phrases related to arguments or justifications
the word "that" in various contexts
New Auto-Interp
Negative Logits
Guard
-0.71
Plus
-0.68
Bonus
-0.67
EMBER
-0.67
gments
-0.66
guard
-0.65
Laughs
-0.62
RIP
-0.62
AND
-0.61
wn
-0.61
POSITIVE LOGITS
they
0.75
although
0.75
"[
0.74
cher
0.68
prevailed
0.67
eday
0.65
misunder
0.65
"...
0.65
there
0.64
it
0.63
Activations Density 0.196%