INDEX
Explanations
instances of the word "our"
possessive pronouns, particularly "our"
New Auto-Interp
Negative Logits
PLA
-0.70
WAR
-0.69
FU
-0.67
Rasmussen
-0.64
Daly
-0.61
citation
-0.60
DERR
-0.59
Verse
-0.58
Citation
-0.58
minster
-0.58
POSITIVE LOGITS
selves
1.43
neys
1.19
neau
1.00
our
0.98
dain
0.80
self
0.79
¯¯¯¯¯¯¯¯
0.77
izont
0.77
dan
0.77
idine
0.76
Activations Density 0.017%