INDEX
Explanations
phrases indicating causation or attribution to actions
New Auto-Interp
Negative Logits
málo
-0.16
empo
-0.16
amiliar
-0.15
nelly
-0.15
ibrary
-0.15
irit
-0.14
ively
-0.14
KeySpec
-0.14
å¿Ļ
-0.14
доÑĢ
-0.14
POSITIVE LOGITS
-products
0.25
means
0.24
products
0.21
gone
0.20
chance
0.20
-election
0.20
virtue
0.20
/on
0.19
rne
0.18
voor
0.17
Activations Density 0.316%