INDEX
Explanations
questions or statements involving questioning about actions or situations
questions about information and understanding
New Auto-Interp
Negative Logits
tails
-0.85
\\\\\\\\
-0.79
astered
-0.77
esm
-0.76
uned
-0.76
arget
-0.76
alde
-0.72
rovers
-0.72
icked
-0.69
tra
-0.69
POSITIVE LOGITS
Baz
0.72
pige
0.71
calib
0.70
forgiveness
0.70
anybody
0.69
permission
0.66
possible
0.66
anyone
0.66
bothered
0.65
exactly
0.64
Activations Density 0.085%