INDEX
Explanations
helpful and harmless purpose
New Auto-Interp
Negative Logits
различ
0.71
their
0.66
respective
0.65
çeşitli
0.64
Meanwhile
0.64
your
0.64
subseteq
0.63
various
0.62
iyong
0.62
各类
0.61
POSITIVE LOGITS
job
1.29
goal
1.13
priority
1.05
motto
0.98
biggest
0.93
dad
0.93
JOB
0.91
job
0.89
mom
0.89
motivation
0.88
Activations Density 0.225%