INDEX
Explanations
references to individuals and their roles or relationships
New Auto-Interp
Negative Logits
inx
-0.15
(*(
-0.15
/pkg
-0.15
ylon
-0.14
äº
-0.14
èī¯
-0.13
urd
-0.13
edik
-0.13
assing
-0.13
(*((
-0.13
POSITIVE LOGITS
prefer
0.29
fancy
0.26
prefer
0.25
preference
0.21
Prefer
0.21
looking
0.19
Require
0.18
already
0.18
like
0.18
simply
0.18
Activations Density 0.087%