INDEX
Explanations
words related to specific attributes or features of objects or individuals
references to defining traits or features of people, objects, or concepts
New Auto-Interp
Negative Logits
ת
-0.72
×Ķ
-0.71
GS
-0.69
pg
-0.68
kos
-0.68
NAS
-0.67
ski
-0.66
POR
-0.65
gur
-0.65
ש
-0.64
POSITIVE LOGITS
characteristics
1.28
istics
1.12
traits
1.10
charact
1.02
eatures
0.99
qualities
0.96
weaknesses
0.93
similarities
0.88
attributes
0.88
Features
0.87
Activations Density 0.011%