INDEX
Explanations
instances of questions or rhetorical questions
New Auto-Interp
Negative Logits
hya
-0.16
alar
-0.14
oplay
-0.14
اگ
-0.14
ĵ¨
-0.14
oine
-0.14
ancy
-0.14
leys
-0.13
cke
-0.13
ih
-0.13
POSITIVE LOGITS
well
0.21
Glad
0.21
Well
0.19
simple
0.18
answer
0.18
Well
0.17
Answer
0.17
nothing
0.17
exactly
0.17
simply
0.16
Activations Density 0.128%