INDEX
Explanations
phrases related to backing down or retracting statements
phrases related to refusal or persistence in backing down
New Auto-Interp
Negative Logits
anon
-0.83
arth
-0.77
nesota
-0.74
anan
-0.73
oven
-0.70
lav
-0.68
marks
-0.67
oran
-0.67
teenth
-0.66
ross
-0.65
POSITIVE LOGITS
blindly
0.80
hesitate
0.79
apologise
0.75
hesitation
0.74
apology
0.73
forcefully
0.73
uncond
0.73
sooner
0.72
antic
0.72
vigorously
0.72
Activations Density 0.159%