INDEX
Explanations
references to humans and their interactions with other beings or entities
New Auto-Interp
Negative Logits
iverz
-0.17
eyse
-0.17
icast
-0.15
jours
-0.15
uzey
-0.15
agnar
-0.15
arness
-0.15
ork
-0.15
allah
-0.14
agn
-0.14
POSITIVE LOGITS
human
0.63
humans
0.59
human
0.50
Humans
0.49
Human
0.48
-human
0.47
Human
0.44
人类
0.43
Humans
0.40
_human
0.40
Activations Density 0.170%