INDEX
Explanations
proper names of individuals
names of individuals and instances of derogatory or disparaging language
New Auto-Interp
Negative Logits
Tok
-0.82
aido
-0.72
bos
-0.71
replication
-0.70
Tok
-0.70
rawdownloadcloneembedreportprint
-0.68
slave
-0.68
Phase
-0.68
BF
-0.68
injection
-0.68
POSITIVE LOGITS
Moreno
2.13
Hayden
2.04
derogatory
1.58
dispar
1.32
depl
1.26
Flat
1.24
Chin
1.04
Jerome
0.97
Fern
0.94
Chester
0.94
Activations Density 0.042%