INDEX
Explanations
references to apologies and accountability in statements
New Auto-Interp
Negative Logits
.spy
-0.13
tier
-0.13
ativ
-0.12
duit
-0.12
ожеÑĤ
-0.12
ạnh
-0.12
imir
-0.12
orney
-0.12
elter
-0.12
é¼ĵ
-0.12
POSITIVE LOGITS
apology
0.69
apologies
0.66
apolog
0.61
apologize
0.60
apologized
0.60
apologise
0.54
Ap
0.47
sorry
0.46
remorse
0.46
repent
0.45
Activations Density 0.238%