INDEX
Explanations
numbers in textual formats
occurrences of various numerical references and related terms
New Auto-Interp
Negative Logits
hips
-0.89
loo
-0.83
WARD
-0.70
wards
-0.69
ioned
-0.68
Denis
-0.67
fully
-0.66
Hilton
-0.63
lain
-0.63
Leopard
-0.61
POSITIVE LOGITS
eral
1.05
ero
1.04
posium
1.03
pty
1.03
mus
0.95
phony
0.92
bs
0.92
pt
0.87
asm
0.86
urg
0.86
Activations Density 0.042%