INDEX
Explanations
affirmative or descriptive phrases about individuals' roles or statuses
New Auto-Interp
Negative Logits
oyer
-0.17
istine
-0.14
ovice
-0.14
vn
-0.14
Unnamed
-0.14
aper
-0.14
APER
-0.13
ÙĪØ±Ùĩ
-0.13
áº
-0.13
armor
-0.13
POSITIVE LOGITS
among
0.18
professor
0.17
both
0.15
een
0.15
Professor
0.14
owner
0.14
our
0.14
eous
0.14
nothing
0.14
a
0.14
Activations Density 0.060%