INDEX
Explanations
protected characteristics and hatred
New Auto-Interp
Negative Logits
percents
0.44
genders
0.41
সড়ক
0.40
procent
0.39
Haem
0.39
personalities
0.38
ğinde
0.38
BIO
0.38
createElement
0.38
etx
0.38
POSITIVE LOGITS
protected
0.67
protected
0.64
Protected
0.58
hatred
0.55
unfairly
0.52
Protected
0.52
immutable
0.49
unjustly
0.49
grounds
0.47
religion
0.47
Activations Density 0.025%