INDEX
Explanations
references to humans and human-related concepts
New Auto-Interp
Negative Logits
strecke
-0.40
Біографія
-0.37
gối
-0.36
azgo
-0.36
boste
-0.33
vueltas
-0.32
IBOutlet
-0.32
ebbe
-0.32
vanguardia
-0.32
menetap
-0.32
POSITIVE LOGITS
human
4.13
Human
3.72
Human
3.70
human
3.69
HUMAN
3.34
HUMAN
3.23
humans
2.67
humaine
2.59
humain
2.59
humano
2.58
Activations Density 0.103%