INDEX
Explanations
the use of first-person pronouns indicating personal involvement or experiences
New Auto-Interp
Negative Logits
=-=-=-=-
-0.71
illac
-0.66
sacrific
-0.64
flix
-0.64
horizont
-0.60
wherein
-0.60
hedon
-0.59
dfx
-0.58
Hayward
-0.58
Camer
-0.57
POSITIVE LOGITS
not
0.76
suppose
0.70
iking
0.69
starting
0.69
reporting
0.67
ussian
0.67
nt
0.67
eling
0.67
DEN
0.66
still
0.65
Activations Density 0.069%