INDEX
Explanations
phrases related to being warned, surprised, challenged, or advised by others
references to "us" or collective experiences and actions
New Auto-Interp
Negative Logits
fect
-0.71
tein
-0.70
lets
-0.65
CPC
-0.63
livest
-0.62
ussen
-0.60
ye
-0.59
chaired
-0.58
served
-0.58
stick
-0.57
POSITIVE LOGITS
selves
1.17
hers
1.10
ern
0.93
aning
0.90
ourselves
0.84
leep
0.82
selves
0.80
urious
0.79
ury
0.78
eleph
0.78
Activations Density 0.066%