INDEX
Explanations
words related to reasoning and justification
New Auto-Interp
Negative Logits
ammy
-0.78
kick
-0.74
yang
-0.71
jri
-0.67
chu
-0.67
JO
-0.67
rael
-0.65
hops
-0.64
hire
-0.64
along
-0.64
POSITIVE LOGITS
izations
1.36
isations
1.26
ization
1.22
isation
1.16
izing
1.15
izers
1.12
istic
1.11
izer
1.10
isers
1.06
izes
1.05
Activations Density 0.022%