INDEX
Explanations
phrases expressing strong opinions or beliefs
phrases emphasizing the necessity or importance of certain actions or considerations
New Auto-Interp
Negative Logits
Wid
-0.74
Wick
-0.66
plex
-0.66
maze
-0.62
Bomber
-0.62
urrence
-0.61
Kand
-0.59
Others
-0.58
Or
-0.58
Puzzle
-0.58
POSITIVE LOGITS
able
1.05
fitting
1.01
judged
1.00
treated
0.95
acons
0.91
regarded
0.91
hemoth
0.90
ashamed
0.89
leeve
0.89
arers
0.86
Activations Density 0.076%