INDEX
Explanations
references to experiences or actions in a first-person perspective
references to the concept of "person" in various contexts
New Auto-Interp
Negative Logits
DERR
-0.73
enthal
-0.69
Tx
-0.68
ORK
-0.66
tty
-0.65
Mb
-0.65
YP
-0.64
CCC
-0.64
avoidance
-0.63
Phill
-0.63
POSITIVE LOGITS
nel
1.08
hood
1.01
ality
0.94
atives
0.92
uscript
0.87
alities
0.86
izontal
0.85
acles
0.81
atural
0.80
alties
0.78
Activations Density 0.032%