INDEX
Explanations
concessive phrases indicating a contrast or counterargument
New Auto-Interp
Negative Logits
iginal
-0.15
pq
-0.15
sez
-0.14
aille
-0.14
uchs
-0.14
çļĦåľ°
-0.14
ÑģооÑĤвеÑĤ
-0.13
plusplus
-0.13
ubre
-0.13
浦
-0.13
POSITIVE LOGITS
ness
0.25
forth
0.20
NESS
0.20
amt
0.16
ening
0.16
theless
0.16
atten
0.15
Ù쨥ÙĨ
0.15
obel
0.15
umber
0.14
Activations Density 0.020%