INDEX
Explanations
words indicating a contradiction or alternative viewpoint
contrasting phrases that introduce shifts in arguments
New Auto-Interp
Negative Logits
meta
-0.66
zero
-0.62
ILLE
-0.61
pert
-0.60
emn
-0.59
imgur
-0.57
ogg
-0.57
SI
-0.57
coat
-0.56
ruit
-0.56
POSITIVE LOGITS
rather
1.90
rather
1.50
instead
1.39
Rather
1.31
merely
1.19
Rather
1.12
instead
1.07
Instead
1.06
nevertheless
1.04
Instead
1.02
Activations Density 0.077%