INDEX
Explanations
contrastive conjunctions and qualifiers that indicate complexity or exception in arguments
New Auto-Interp
Negative Logits
boro
-0.18
bole
-0.18
unt
-0.18
gary
-0.15
bsp
-0.15
ushing
-0.15
unt
-0.14
hq
-0.14
im
-0.13
Vaugh
-0.13
POSITIVE LOGITS
ĥn
0.16
oen
0.15
Denn
0.15
iffin
0.14
ifo
0.14
315
0.14
yleft
0.14
sWith
0.14
볬
0.13
imals
0.13
Activations Density 0.262%