INDEX
Explanations
expressions of negativity or pessimism
New Auto-Interp
Negative Logits
Sig
-0.58
Corpus
-0.58
Rolls
-0.57
Dug
-0.57
Fritz
-0.56
ITED
-0.56
Sigma
-0.56
Rodgers
-0.55
supplemented
-0.55
Rohingya
-0.55
POSITIVE LOGITS
toe
0.93
browser
0.82
desc
0.82
instance
0.82
sent
0.82
equ
0.80
inf
0.80
combat
0.79
depth
0.79
des
0.79
Activations Density 0.016%