INDEX
Explanations
words related to specific professions or specific scenarios involving those professions
elements associated with authority figures and societal structures
New Auto-Interp
Negative Logits
urations
-0.74
Dover
-0.69
sequ
-0.67
KC
-0.64
respectively
-0.59
Codex
-0.57
ophon
-0.55
Ply
-0.55
Simpl
-0.54
ollow
-0.54
POSITIVE LOGITS
knows
0.92
dies
0.89
thinks
0.88
who
0.88
who
0.85
decides
0.84
masturb
0.84
whom
0.83
wears
0.81
wants
0.80
Activations Density 0.766%