INDEX
Explanations
phrases related to caution or warning
phrases that caution against negative actions or behaviors
New Auto-Interp
Negative Logits
upon
-0.70
ilogy
-0.68
ourses
-0.67
leground
-0.66
stabilized
-0.64
albeit
-0.63
unparalleled
-0.60
ially
-0.59
correspond
-0.58
ancest
-0.56
POSITIVE LOGITS
yourselves
1.24
yourself
1.17
Yourself
1.02
fooled
0.91
anymore
0.83
blindly
0.81
your
0.78
ãĤ®
0.77
ANY
0.76
fool
0.76
Activations Density 0.251%