INDEX
Explanations
references to cats or feline-themed terms
New Auto-Interp
Negative Logits
idan
-0.19
tz
-0.17
eds
-0.16
orer
-0.15
eration
-0.15
ureka
-0.14
hower
-0.14
738
-0.14
HL
-0.14
eping
-0.14
POSITIVE LOGITS
égorie
0.26
apult
0.25
nip
0.19
amount
0.19
fish
0.18
ting
0.17
calls
0.17
ucci
0.17
-corner
0.17
elog
0.17
Activations Density 0.044%