Here we look at the performance of fashion MNIST classification task using different activation functions.
		We try different model depth and structures
	
		
fc(N) = nn.Sequential(
	flatten(),
	linear(28*28, ... ), activation(), batchnorm(), dropout(),
	linear( ... , ... ), activation(), batchnorm(), dropout(),
	linear( ... , 10  ), activation(), batchnorm(), dropout(),
	softmax(),
)
	
		For following analysis, we'll only run experiments on fc(4) and fc(16) and keep dropout=0
	
	
cn(N,m_size,M) = nn.Sequential(
	# feature extraction part, has M convolutions
	conv2d(  1  , ..., kernel=3), activation(), maxpool2d(m_size),
	conv2d( ... , ..., kernel=3), activation(), maxpool2d(m_size),
	conv2d( ... , 128, kernel=3), activation(), maxpool2d(m_size),
	# classification part, has N linear units
	flatten(),
	linear( ... , ... ), activation(),
	linear( ... , ... ), activation(),
	linear( ... , 10  ), activation(),
	softmax(),
)
	
	
		model
		activation
		dropout