INDEX
Explanations
questions starting with "Why."
questions and phrases that express uncertainty or curiosity about reasons and explanations
New Auto-Interp
Negative Logits
ILCS
-0.80
icing
-0.67
OLOGY
-0.62
ibus
-0.61
kilometres
-0.60
combe
-0.58
ylan
-0.58
achus
-0.57
\\\\\\\\\\\\\\\\
-0.57
atellite
-0.57
POSITIVE LOGITS
ãĢij
0.71
ppo
0.67
]).
0.66
Matters
0.66
so
0.65
ãĤ»
0.65
ug
0.65
]),
0.64
Hebdo
0.63
differently
0.63
Activations Density 0.270%