INDEX
Explanations
verbs related to action or decision making
phrases or expressions related to causality and consequences
New Auto-Interp
Negative Logits
ector
-0.82
allery
-0.82
atform
-0.76
INAL
-0.76
vantage
-0.74
eatures
-0.72
uid
-0.70
ributed
-0.70
aic
-0.69
ELD
-0.67
POSITIVE LOGITS
hating
1.54
worrying
1.48
forgetting
1.45
messing
1.40
pretending
1.39
thinking
1.36
wanting
1.35
liking
1.35
wasting
1.35
wondering
1.33
Activations Density 0.449%