INDEX
Explanations
apologies or statements of regret
expressions of apology
New Auto-Interp
Negative Logits
ccording
-0.84
eele
-0.80
edience
-0.78
eely
-0.73
irrel
-0.72
inct
-0.72
cffff
-0.71
kefeller
-0.71
weeney
-0.70
hig
-0.68
POSITIVE LOGITS
guys
0.97
sorry
0.96
folks
0.87
excuse
0.83
:(
0.82
about
0.80
ladies
0.79
sir
0.79
fully
0.78
bout
0.78
Activations Density 0.020%