INDEX
Explanations
references to historical and societal critiques, particularly in relation to issues of oppression or unethical practices
New Auto-Interp
Negative Logits
Bias
-0.16
intrig
-0.16
bias
-0.15
/welcome
-0.15
Spo
-0.15
Bias
-0.15
ussen
-0.14
istani
-0.14
biased
-0.14
à¹Īา
-0.14
POSITIVE LOGITS
demon
0.25
romantic
0.25
lion
0.24
commod
0.23
glam
0.23
å¦ĸ
0.23
trivial
0.22
valor
0.22
normal
0.22
NORMAL
0.22
Activations Density 0.172%