INDEX
Explanations
mentions of names, particularly those starting with "Tre"
words or phrases related to measurements or evaluations
New Auto-Interp
Negative Logits
ategory
-0.75
nect
-0.72
skim
-0.66
Tsukuyomi
-0.64
HAHA
-0.64
disadvant
-0.63
ortium
-0.62
challeng
-0.60
eming
-0.58
AIDS
-0.58
POSITIVE LOGITS
inarily
0.74
initialized
0.69
andum
0.69
andise
0.68
zx
0.67
edit
0.65
ese
0.63
ousy
0.63
uries
0.62
horn
0.62
Activations Density 0.196%