INDEX
Explanations
references to the concept of "humans" and their qualities or conditions
New Auto-Interp
Negative Logits
kệ
-0.42
respective
-0.38
distinción
-0.38
ท้าย
-0.37
illage
-0.36
IKI
-0.36
Besten
-0.35
ilaire
-0.35
tilles
-0.35
retum
-0.35
POSITIVE LOGITS
Human
1.16
Human
1.11
human
1.10
HUMAN
1.00
human
0.99
HUMAN
0.94
Humans
0.90
Humans
0.90
humans
0.82
umani
0.81
Activations Density 0.108%