INDEX
Explanations
sentences that express personal experiences and reflections
New Auto-Interp
Negative Logits
)\}$
-0.78
("")]
-0.77
'],
-0.77
'),
-0.76
"),
-0.74
;";
-0.70
/>";
-0.69
]}"
-0.68
)";
-0.67
/";
-0.67
POSITIVE LOGITS
because
0.70
they
0.69
They
0.67
because
0.66
he
0.66
I
0.65
本当は
0.61
Because
0.59
They
0.58
originally
0.58
Activations Density 0.373%