INDEX
Explanations
mentions of humans
references to humans and their characteristics or behaviors
New Auto-Interp
Negative Logits
Comprehensive
-0.69
paragraph
-0.66
Neigh
-0.65
Statements
-0.64
onite
-0.63
Sop
-0.62
olla
-0.61
FU
-0.61
Coun
-0.61
ANC
-0.60
POSITIVE LOGITS
folk
1.16
beings
0.96
oids
0.85
zee
0.80
Humans
0.76
omorphic
0.76
anguages
0.76
readable
0.75
inhab
0.74
mite
0.74
Activations Density 0.021%