INDEX
Explanations
conjunctions connecting different elements or concepts
phrases emphasizing contrast or exceptions
New Auto-Interp
Negative Logits
coat
-0.76
orc
-0.75
irm
-0.73
ct
-0.71
wake
-0.71
velop
-0.70
oresc
-0.67
wrap
-0.67
ctor
-0.66
Plex
-0.66
POSITIVE LOGITS
also
1.00
ALSO
0.86
actively
0.81
secondly
0.79
importantly
0.75
downright
0.75
moreover
0.74
yon
0.71
possibly
0.71
also
0.70
Activations Density 0.051%