INDEX
Explanations
statements related to subjective opinions and perspectives
New Auto-Interp
Negative Logits
niſſe
-1.05
müſſen
-0.98
<unused43>
-0.98
<pad>
-0.97
<unused41>
-0.97
geweſen
-0.97
<unused14>
-0.96
<unused3>
-0.96
[@BOS@]
-0.96
<unused1>
-0.96
POSITIVE LOGITS
,
0.61
…
0.54
...
0.50
…
0.45
...
0.39
?
0.38
……
0.36
!
0.36
really
0.35
*
0.34
Activations Density 0.409%