INDEX
Explanations
terminology related to inhumane treatment or categorization of individuals
references to human and non-human distinctions
New Auto-Interp
Negative Logits
Rolls
-0.69
rypt
-0.69
CHAT
-0.69
Decker
-0.68
Reb
-0.67
Roll
-0.63
Channel
-0.63
Mine
-0.63
Clerk
-0.63
Line
-0.61
POSITIVE LOGITS
itarian
1.09
human
1.08
beings
1.03
thood
0.94
ity
0.93
zee
0.89
theless
0.89
ciating
0.89
icity
0.89
humans
0.88
Activations Density 0.009%