INDEX
Explanations
references to people's names or pronouns in a conversational context
New Auto-Interp
Negative Logits
AAA
-0.76
Lex
-0.67
ãĥ¼ãĥ³
-0.63
fif
-0.58
lehem
-0.57
Seym
-0.57
CAP
-0.56
SOC
-0.56
ELF
-0.55
NAT
-0.55
POSITIVE LOGITS
himself
1.25
enegger
1.22
testified
1.15
admits
1.08
's
1.07
joked
1.04
concedes
1.03
wrote
0.98
says
0.97
explained
0.96
Activations Density 2.048%