INDEX
Explanations
statements indicating a morally or practically "right" action or decision
phrases indicating correctness or justification in decision-making
New Auto-Interp
Negative Logits
gyn
-0.85
stories
-0.85
Lists
-0.80
olics
-0.79
ographics
-0.77
Races
-0.76
fam
-0.75
bows
-0.73
images
-0.72
»Ĵ
-0.72
POSITIVE LOGITS
move
1.46
decision
1.44
approach
1.34
step
1.34
course
1.29
choice
1.28
tactic
1.25
option
1.24
strategy
1.23
stance
1.20
Activations Density 0.226%