INDEX
Explanations
adjectives and terms that imply capability or potential
New Auto-Interp
Negative Logits
ed
-0.27
ing
-0.25
arily
-0.23
edb
-0.21
ical
-0.20
ically
-0.19
ese
-0.17
emann
-0.17
ers
-0.17
ede
-0.17
POSITIVE LOGITS
0.23
able
0.20
atable
0.20
/edit
0.19
/read
0.18
-bodied
0.17
options
0.17
mente
0.17
/un
0.17
/use
0.17
Activations Density 0.159%