INDEX
Explanations
phrases about societal structures and inequalities
New Auto-Interp
Negative Logits
otherwise
-0.22
Otherwise
-0.18
enough
-0.17
awa
-0.16
OTHERWISE
-0.16
dernier
-0.16
last
-0.16
Otherwise
-0.16
lint
-0.15
ady
-0.15
POSITIVE LOGITS
orative
0.18
legate
0.18
world
0.17
ramework
0.17
892
0.16
ihn
0.16
lessness
0.16
opat
0.15
igator
0.15
suite
0.15
Activations Density 0.027%