INDEX
Explanations
words related to personal actions or self-referential behavior
references to self-identification or self-involvement
New Auto-Interp
Negative Logits
ammy
-0.78
heny
-0.77
ulton
-0.73
artisan
-0.73
illery
-0.72
microsoft
-0.72
cemic
-0.70
apple
-0.69
sweet
-0.69
rought
-0.68
POSITIVE LOGITS
tremend
0.85
profess
0.79
selves
0.75
underwater
0.75
åĤ
0.73
submar
0.71
ens
0.70
personally
0.69
worshipped
0.68
creatively
0.67
Activations Density 0.048%