INDEX
Explanations
words that express perception or likelihood
New Auto-Interp
Negative Logits
ught
-0.17
sein
-0.16
pent
-0.14
itol
-0.14
ppe
-0.14
pot
-0.14
ses
-0.14
ewidth
-0.14
omer
-0.14
LETE
-0.14
POSITIVE LOGITS
lessly
0.17
ingly
0.17
ance
0.15
váºŃy
0.15
URRENT
0.15
razione
0.14
ively
0.13
alf
0.13
417
0.13
cref
0.13
Activations Density 0.045%