INDEX
Explanations
phrases indicating a contrast or emphasizing a point
statements of fact or significant affirmations
New Auto-Interp
Negative Logits
LC
-0.71
oka
-0.69
oj
-0.66
apo
-0.66
tu
-0.63
unity
-0.62
rals
-0.59
riors
-0.58
rolling
-0.57
ogun
-0.56
POSITIVE LOGITS
downright
1.16
quite
0.88
worse
0.81
even
0.80
pretty
0.78
proverb
0.71
envy
0.67
opposite
0.67
counterproductive
0.67
almost
0.66
Activations Density 0.315%