INDEX
Explanations
comparative phrases indicating greater or lesser values
New Auto-Interp
Negative Logits
)
-0.64
.
-0.64
{})-0.63
])
-0.60
)}
-0.59
]_
-0.55
<eos>
-0.55
).
-0.55
)}}
-0.55
),
-0.54
POSITIVE LOGITS
>=
1.05
$>
1.01
$>$
0.99
>$
0.98
(>
0.95
>
0.94
>/
0.94
»>
0.94
>>>>>>>>
0.93
>\
0.92
Activations Density 0.253%