INDEX
Explanations
pronouns (primarily "His") followed by specific descriptions or actions
references to a specific individual's experiences or actions
New Auto-Interp
Negative Logits
dding
-0.70
女
-0.69
fitting
-0.66
ÙIJ
-0.66
qi
-0.62
ãĥ´ãĤ¡
-0.61
Õ
-0.61
м
-0.60
е
-0.60
и
-0.59
POSITIVE LOGITS
panic
1.01
Majesty
0.95
own
0.94
Own
0.94
resy
0.93
self
0.89
itage
0.84
anmar
0.84
millenn
0.82
rera
0.81
Activations Density 0.025%