INDEX
Explanations
abstraction and representation
New Auto-Interp
Negative Logits
σταν
0.45
unethical
0.43
तकनीकी
0.42
aini
0.42
technischen
0.41
modernization
0.40
friendliness
0.40
μφωνα
0.40
पास
0.39
spécifiques
0.39
POSITIVE LOGITS
cognitive
1.07
Cognitive
1.00
Cogn
0.98
cognitive
0.93
cognition
0.91
cognit
0.88
mental
0.86
Reasoning
0.80
cognitiva
0.80
Mental
0.80
Activations Density 0.042%