INDEX
Explanations
phrases indicating distortion or manipulation of facts
New Auto-Interp
Negative Logits
à¤ħपर
-0.15
Overrides
-0.15
MOCK
-0.14
Planning
-0.14
å¼Ģ
-0.14
imetype
-0.14
blas
-0.13
velopment
-0.13
rál
-0.13
Mock
-0.13
POSITIVE LOGITS
distortion
0.38
cherry
0.32
distort
0.32
distorted
0.31
selective
0.29
selectively
0.28
sensational
0.26
dist
0.26
dist
0.25
biased
0.24
Activations Density 0.385%