INDEX
Explanations
pronouns 'we' and 'our'
references to collective responsibility or shared experiences
New Auto-Interp
Negative Logits
REDACTED
-0.77
Publication
-0.66
gratification
-0.64
odor
-0.64
Crush
-0.61
cum
-0.60
Owner
-0.60
personal
-0.59
Hole
-0.58
Levine
-0.58
POSITIVE LOGITS
're
1.22
've
1.21
'll
0.99
athered
0.98
akening
0.98
asel
0.96
ourselves
0.95
ird
0.93
lder
0.92
IRD
0.92
Activations Density 0.239%