INDEX
Explanations
apologies or expressions of regret
expressions of apology and regret
New Auto-Interp
Negative Logits
lite
-0.89
fman
-0.84
eele
-0.78
irtual
-0.77
ngth
-0.77
kefeller
-0.76
hill
-0.74
itect
-0.73
rients
-0.73
Goal
-0.73
POSITIVE LOGITS
sorry
0.87
missed
0.83
unres
0.82
sorry
0.80
inconven
0.78
omission
0.77
miscar
0.76
Chr
0.72
Sorry
0.72
Crimes
0.72
Activations Density 0.127%