INDEX
Explanations
references to human-centric concepts and rights
New Auto-Interp
Negative Logits
aniversario
-0.73
Trotz
-0.68
ValueStyle
-0.68
Diwali
-0.67
łość
-0.65
aikana
-0.64
productivo
-0.64
isles
-0.63
Nomenclature
-0.62
kulum
-0.62
POSITIVE LOGITS
human
2.47
human
2.21
Human
2.19
HUMAN
2.17
Human
2.15
HUMAN
2.06
humans
1.93
Humans
1.74
humano
1.72
humanos
1.69
Activations Density 0.084%