INDEX
Explanations
concepts related to deception and credibility
New Auto-Interp
Negative Logits
ób
-0.14
بعد
-0.13
especially
-0.13
onec
-0.13
λÏī
-0.13
ालन
-0.13
_BEFORE
-0.12
пÑĢежде
-0.12
_named
-0.12
ovÃŃ
-0.12
POSITIVE LOGITS
Conversely
0.58
convers
0.53
whereas
0.44
Whereas
0.43
meanwhile
0.43
Meanwhile
0.41
naopak
0.40
Meanwhile
0.39
Likewise
0.38
likewise
0.38
Activations Density 0.213%