INDEX
Explanations
phrases that highlight reasons or justifications
New Auto-Interp
Negative Logits
[
-0.51
**************
-0.49
[
-0.47
[*
-0.45
*
-0.44
[[
-0.43
&
-0.42
de
-0.42
Forst
-0.42
tabular
-0.42
POSITIVE LOGITS
raisonnable
0.78
saveiro
0.78
démocr
0.76
citoy
0.75
pédagogique
0.75
pédagog
0.73
Consejos
0.72
sánchez
0.72
pegat
0.71
desmotivaciones
0.71
Activations Density 0.520%