INDEX
Explanations
terms related to criticism or analysis of stereotypes and their implications
New Auto-Interp
Negative Logits
ertino
-0.20
stack
-0.16
avery
-0.16
itan
-0.15
.AF
-0.15
ãĥ¡ãĥ©
-0.14
ismet
-0.14
γμα
-0.14
stress
-0.14
itor
-0.14
POSITIVE LOGITS
wart
0.18
Ster
0.18
viso
0.17
ois
0.16
Ïĩει
0.15
hread
0.14
θε
0.14
ster
0.14
hed
0.14
vant
0.14
Activations Density 0.010%