INDEX
Explanations
phrases that introduce information or present conclusions
clauses or phrases that introduce defining characteristics or explanations
New Auto-Interp
Negative Logits
ugu
-0.76
aq
-0.73
oug
-0.64
iq
-0.63
ahime
-0.63
UG
-0.61
ablish
-0.59
MQ
-0.59
roying
-0.58
hent
-0.58
POSITIVE LOGITS
horr
1.14
extends
0.94
accompanies
0.90
ought
0.89
encompasses
0.89
haun
0.87
culmin
0.86
echoes
0.86
coincides
0.85
occurs
0.85
Activations Density 0.145%