INDEX
Explanations
statements of personal opinion and disagreement
New Auto-Interp
Negative Logits
Ðĭ
-0.16
olf
-0.16
omez
-0.15
plevel
-0.14
BoxFit
-0.14
idth
-0.14
icious
-0.14
itag
-0.13
getRoot
-0.13
ÑĩиÑħ
-0.13
POSITIVE LOGITS
majority
0.27
Majority
0.25
minority
0.22
isol
0.21
Alone
0.18
isolate
0.18
alone
0.17
lone
0.17
popular
0.17
opinion
0.17
Activations Density 0.158%