INDEX
Explanations
questions that are posed in a rhetorical or philosophical context
New Auto-Interp
Negative Logits
.
-0.54
-0.52
In
-0.50
L
-0.47
D
-0.47
The
-0.46
K
-0.45
</i>
-0.44
I
-0.44
DED
-0.44
POSITIVE LOGITS
?
1.95
%?
1.75
?—
1.69
?}
1.69
?
1.62
?’
1.61
?&
1.60
?<
1.59
?”
1.59
?"
1.58
Activations Density 0.147%