INDEX
Explanations
phrases indicating explanation or cause
phrases indicating reasons or justifications
New Auto-Interp
Negative Logits
chn
-0.85
OLOGY
-0.82
busters
-0.72
astics
-0.71
alsa
-0.70
thumbnails
-0.70
ILCS
-0.70
framework
-0.70
forts
-0.70
tle
-0.69
POSITIVE LOGITS
why
1.22
preferring
1.12
wanting
1.02
inaction
0.97
optimism
0.97
choosing
0.96
dismissing
0.96
excluding
0.95
cance
0.95
believing
0.95
Activations Density 0.082%