INDEX
Explanations
words followed by inappropriate or special characters
New Auto-Interp
Negative Logits
nurs
-0.69
icing
-0.65
ensical
-0.64
utterstock
-0.63
sacrific
-0.61
recip
-0.61
undergrad
-0.61
pulp
-0.61
illac
-0.61
iewicz
-0.60
POSITIVE LOGITS
´
0.84
rio
0.83
Rah
0.83
¯
0.77
tri
0.77
âĤ¬
0.76
til
0.76
raid
0.76
ready
0.74
¢
0.74
Activations Density 0.009%