INDEX
Explanations
conversations involving expressions of regret or apologies
New Auto-Interp
Negative Logits
(
-0.93
(&
-0.81
&
-0.78
[
-0.75
).[
-0.74
)[
-0.72
([
-0.68
)&
-0.66
“
-0.65
')[
-0.64
POSITIVE LOGITS
-"
1.64
—”
1.61
-”
1.61
--"
1.52
—"
1.52
-“
1.44
-",
1.28
——”
1.25
…”
1.20
—“
1.18
Activations Density 0.309%