INDEX
Explanations
actions or events involving public disclosure or communication
New Auto-Interp
Negative Logits
Wikimedia
-0.76
Pg
-0.64
gra
-0.60
Chau
-0.57
quot
-0.56
"}
-0.55
inf
-0.54
ibal
-0.53
atre
-0.52
hift
-0.52
POSITIVE LOGITS
DERR
0.91
why
0.78
how
0.76
beforehand
0.73
why
0.72
orally
0.69
whether
0.68
»Ĵ
0.67
whether
0.65
cerning
0.64
Activations Density 5.098%