INDEX
Explanations
assertive statements or declarations that are straightforward or explicit
New Auto-Interp
Negative Logits
eries
-0.72
resil
-0.71
passively
-0.69
Loft
-0.68
INTER
-0.67
tremend
-0.66
ITAL
-0.65
nesota
-0.64
inse
-0.64
rology
-0.64
POSITIVE LOGITS
ances
1.16
cut
1.05
iary
0.95
deline
0.93
ance
0.92
indication
0.82
headed
0.82
distinction
0.81
ively
0.79
faced
0.77
Activations Density 1.732%