INDEX
Explanations
proper nouns or names of individuals
mentions of significant events, arrests, and consequences in societal contexts
New Auto-Interp
Negative Logits
!.
-0.69
inis
-0.65
}.
-0.64
+.
-0.60
};
-0.59
cellaneous
-0.57
utterstock
-0.57
''.
-0.56
.$
-0.56
.''
-0.55
POSITIVE LOGITS
lacks
0.70
hadn
0.70
lacked
0.69
should
0.69
shouldn
0.66
cannot
0.65
had
0.64
hasn
0.60
behaved
0.60
exists
0.58
Activations Density 0.993%