INDEX
Explanations
proper nouns or names of people
references to notable individuals and their actions
New Auto-Interp
Negative Logits
alys
-0.70
rap
-0.64
ento
-0.63
itself
-0.62
itiz
-0.62
earcher
-0.61
rina
-0.61
stem
-0.59
duc
-0.57
afety
-0.56
POSITIVE LOGITS
respectively
1.44
together
1.28
together
1.14
Together
1.00
jointly
0.99
selves
0.94
respective
0.93
apiece
0.88
Together
0.88
mutually
0.88
Activations Density 0.676%