INDEX
Explanations
statements indicating outcomes or consequences
New Auto-Interp
Negative Logits
thing
-0.17
ish
-0.17
est
-0.16
idge
-0.15
ertz
-0.14
former
-0.14
eli
-0.14
essler
-0.14
esh
-0.13
Lei
-0.13
POSITIVE LOGITS
ively
0.21
agli
0.17
ogle
0.17
antly
0.17
ModelIndex
0.16
çuk
0.16
aneously
0.15
uate
0.14
antro
0.14
iveness
0.14
Activations Density 0.043%