INDEX
Explanations
ostensibly or purportedly true
New Auto-Interp
Negative Logits
क्षित
0.50
терпе
0.47
disheart
0.44
bluntly
0.43
deset
0.43
irrever
0.42
прямо
0.42
disrespectful
0.42
stigma
0.41
í
0.41
POSITIVE LOGITS
supposedly
0.60
purportedly
0.60
दावा
0.59
якобы
0.58
ostensibly
0.57
pretends
0.55
purporting
0.54
claiming
0.54
pretend
0.52
claimed
0.51
Activations Density 0.297%