INDEX
Explanations
phrases related to injustice or accountability in societal issues
New Auto-Interp
Negative Logits
bih
-0.14
Plenty
-0.14
oons
-0.13
arin
-0.13
uet
-0.13
922
-0.13
oS
-0.13
iere
-0.13
Independ
-0.13
oten
-0.13
POSITIVE LOGITS
escape
0.34
escapes
0.32
escaping
0.32
escaped
0.30
escape
0.29
Escape
0.27
Escape
0.26
immunity
0.26
impunity
0.25
escaped
0.25
Activations Density 0.085%