INDEX
Explanations
web URLs containing specific keywords
instances of the pronoun "we."
New Auto-Interp
Negative Logits
gratification
-0.70
contradictions
-0.67
ional
-0.65
conflicts
-0.61
contradiction
-0.60
quo
-0.60
derail
-0.59
rug
-0.58
sucker
-0.58
LSD
-0.58
POSITIVE LOGITS
bsite
1.43
eping
1.27
aving
1.18
aning
1.17
athered
1.14
lder
1.13
igh
1.10
avers
1.10
akens
1.10
akening
1.09
Activations Density 0.084%