INDEX
Explanations
phrases indicating types of categories or classifications
New Auto-Interp
Negative Logits
elli
-0.14
330
-0.14
ieber
-0.14
Timber
-0.14
tim
-0.13
iber
-0.13
unger
-0.13
both
-0.12
pur
-0.12
gens
-0.12
POSITIVE LOGITS
rome
0.15
xec
0.14
ÃŃž
0.14
ichert
0.14
müc
0.14
Enemies
0.13
Provid
0.13
coverage
0.12
TION
0.12
Writes
0.12
Activations Density 0.070%