INDEX
Explanations
instances and examples used in explanations or arguments
New Auto-Interp
Negative Logits
<unused43>
-0.96
<unused74>
-0.95
<unused41>
-0.95
<pad>
-0.95
[@BOS@]
-0.95
<unused42>
-0.94
<unused51>
-0.94
<unused28>
-0.94
<unused8>
-0.94
<unused17>
-0.94
POSITIVE LOGITS
,
0.84
первых
0.44
obstante
0.43
However
0.37
Therefore
0.35
However
0.35
however
0.34
Nevertheless
0.33
Moreover
0.31
Furthermore
0.31
Activations Density 0.575%