INDEX
Explanations
references to survival and danger in narratives
New Auto-Interp
Negative Logits
itudes
-0.19
cciones
-0.18
ções
-0.18
rella
-0.17
udes
-0.17
thers
-0.17
uela
-0.17
nds
-0.16
uries
-0.16
enza
-0.16
POSITIVE LOGITS
cimiento
0.23
amento
0.23
imiento
0.19
acimiento
0.19
issement
0.19
isme
0.18
amiento
0.18
onnement
0.17
ogram
0.17
Ī
0.17
Activations Density 0.095%