INDEX
Explanations
references to individuals or groups
New Auto-Interp
Negative Logits
idor
-0.18
bilt
-0.16
atik
-0.15
éϵ
-0.15
åĬ±
-0.14
?action
-0.14
baz
-0.14
undi
-0.14
.getID
-0.14
оÑĤоÑĢ
-0.14
POSITIVE LOGITS
who
0.23
involved
0.19
who
0.18
whom
0.17
responsible
0.17
helm
0.17
het
0.16
Who
0.15
joining
0.14
elper
0.14
Activations Density 0.276%