INDEX
Explanations
conditional statements or phrases expressing hypothetical scenarios
New Auto-Interp
Negative Logits
ÏĢη
-0.17
Quarter
-0.15
ipline
-0.14
Cur
-0.14
indle
-0.14
since
-0.14
Succ
-0.13
aceae
-0.13
ider
-0.13
bo
-0.13
POSITIVE LOGITS
only
0.22
only
0.20
_only
0.18
seulement
0.18
wishes
0.16
ONLY
0.16
Only
0.16
_ONLY
0.15
Only
0.15
ONLY
0.15
Activations Density 0.066%