INDEX
Explanations
negations and contrasts in value-based arguments
New Auto-Interp
Negative Logits
avery
-0.17
453
-0.16
ibold
-0.15
435
-0.15
uben
-0.15
pau
-0.14
StateManager
-0.14
ree
-0.14
anas
-0.14
ffc
-0.14
POSITIVE LOGITS
merely
0.19
åıªæĺ¯
0.17
isol
0.15
solely
0.15
chased
0.15
simply
0.15
ë§Į
0.15
cookie
0.15
mere
0.14
juste
0.14
Activations Density 0.151%