INDEX
Explanations
contextual references to theoretical analysis and models
New Auto-Interp
Negative Logits
myſelf
-0.96
indígen
-0.94
desmotivaciones
-0.93
increí
-0.93
ſta
-0.90
avoient
-0.90
faſt
-0.87
ſelf
-0.86
ientras
-0.85
itſelf
-0.85
POSITIVE LOGITS
↵
0.67
0.61
'
0.60
of
0.60
,
0.57
"
0.56
the
0.56
.
0.53
0
0.51
other
0.50
Activations Density 0.548%