INDEX
Explanations
phrases related to challenges or controversial statements
New Auto-Interp
Negative Logits
Rite
-0.82
urgy
-0.78
effic
-0.68
VERTISEMENT
-0.67
ulatory
-0.67
ulators
-0.67
ulator
-0.65
usterity
-0.65
sav
-0.65
OTOS
-0.64
POSITIVE LOGITS
defy
1.01
dare
0.99
evil
0.92
daring
0.86
boldly
0.86
Dare
0.84
ously
0.83
provoke
0.81
dared
0.81
presume
0.81
Activations Density 0.023%