INDEX
Explanations
contrasting relationships or distinctions between concepts
instances of the word "but" indicating contrastive statements
New Auto-Interp
Negative Logits
uters
-0.79
uter
-0.77
unction
-0.76
minent
-0.74
roy
-0.74
velt
-0.72
alty
-0.72
enter
-0.71
uther
-0.71
ct
-0.70
POSITIVE LOGITS
nor
0.99
suffice
0.92
nevertheless
0.85
alas
0.80
rather
0.78
merely
0.78
luckily
0.77
hey
0.76
fortunately
0.75
chery
0.75
Activations Density 0.105%