INDEX
Explanations
phrases indicating causation or consequences
New Auto-Interp
Negative Logits
Leadership
-0.21
leadership
-0.20
imos
-0.16
Leaders
-0.16
acha
-0.16
enga
-0.15
sein
-0.15
leaders
-0.14
iens
-0.14
imedia
-0.14
POSITIVE LOGITS
nowhere
0.27
directly
0.26
gers
0.25
us
0.24
astr
0.22
them
0.21
ultimately
0.20
toward
0.20
towards
0.20
straight
0.20
Activations Density 0.020%