INDEX
Explanations
themes related to systemic inequality and social power dynamics
New Auto-Interp
Negative Logits
attacking
-0.13
reau
-0.13
Duty
-0.13
íļĮ
-0.13
deficient
-0.13
rarian
-0.13
ndl
-0.13
INCT
-0.13
ourn
-0.12
pai
-0.12
POSITIVE LOGITS
rule
0.43
control
0.42
exercise
0.39
controls
0.39
control
0.34
Controls
0.34
RULE
0.33
controlling
0.33
controls
0.33
rule
0.32
Activations Density 0.371%