INDEX
Explanations
references to specific groups or categories that start with "these."
New Auto-Interp
Negative Logits
αÏħÏĦή
-0.17
ation
-0.15
uz
-0.14
dest
-0.14
ãĥ³ãĥĦ
-0.14
liest
-0.14
ica
-0.14
о
-0.13
ìłģ
-0.13
odal
-0.13
POSITIVE LOGITS
curity
0.29
quence
0.26
kinds
0.25
verity
0.25
sorts
0.24
cond
0.24
same
0.22
guys
0.22
/th
0.20
days
0.20
Activations Density 0.091%