INDEX
Explanations
contradictory statements or contrasting ideas
New Auto-Interp
Negative Logits
meta
-0.66
coat
-0.65
zero
-0.65
asks
-0.61
nat
-0.60
emn
-0.59
ILLE
-0.58
und
-0.58
unc
-0.58
ory
-0.58
POSITIVE LOGITS
rather
1.71
rather
1.39
instead
1.22
Rather
1.20
merely
1.08
nevertheless
1.07
nonetheless
1.01
Rather
1.00
suffice
0.97
Instead
0.96
Activations Density 0.085%