INDEX
Explanations
references to the prevalence and characteristics of specific concepts or phenomena
New Auto-Interp
Negative Logits
itself
-0.35
çļĦä¸Ģ个
-0.22
å®ĥ
-0.21
its
-0.20
ä¸ĢåĢĭ
-0.18
Its
-0.18
ä¸Ģ个
-0.18
ä¸Ģ个人
-0.17
Its
-0.17
коÑĤоÑĢое
-0.16
POSITIVE LOGITS
themselves
0.50
ones
0.33
äºĽ
0.30
those
0.29
thems
0.27
những
0.25
nt
0.24
mga
0.23
those
0.23
few
0.23
Activations Density 0.756%