INDEX
Explanations
phrases or words related to the concept of human traits or characteristics
mentions of human-related concepts or attributes
New Auto-Interp
Negative Logits
slice
-0.71
launcher
-0.71
blocks
-0.67
Block
-0.66
markup
-0.65
Wall
-0.63
cartels
-0.62
Specialist
-0.62
AE
-0.61
aligned
-0.61
POSITIVE LOGITS
hum
4.65
Hum
1.89
Hum
1.69
hum
1.28
HUM
1.23
hor
1.21
hus
1.14
odor
1.11
nat
1.10
hist
1.09
Activations Density 0.009%