INDEX
Explanations
references to George Orwell's works and related themes
New Auto-Interp
Negative Logits
iales
-0.16
ten
-0.15
(FALSE
-0.14
ONGL
-0.14
ecta
-0.14
Ñħо
-0.14
ertino
-0.13
éϵ
-0.13
à¥įतम
-0.13
zd
-0.13
POSITIVE LOGITS
180
0.40
400
0.40
500
0.39
0.39
300
0.37
130
0.36
120
0.36
160
0.35
150
0.35
600
0.34
Activations Density 0.489%