INDEX
Explanations
phrases related to future plans or predictions
expressions related to influence and decision-making
New Auto-Interp
Negative Logits
indo
-0.34
)."
-0.32
Afgh
-0.31
"/>
-0.31
]."
-0.30
disadvant
-0.30
vulner
-0.30
unemploy
-0.29
destro
-0.29
undermin
-0.29
POSITIVE LOGITS
ivating
0.33
utterstock
0.32
ideshow
0.31
agonist
0.31
urable
0.30
asting
0.30
eaturing
0.30
heimer
0.29
iven
0.29
ering
0.29
Activations Density 3.802%