Monday, August 21, 2017

The Sun Eclipse of 2017

Credit: NASA/JPL/Space Science Institute
Released: December 18, 2009 (PIA 11648)


The webcast for this coming eclipse will start in 15 minutes here: https://eclipse2017.nasa.gov/
The eclipse itself will be viewable in an hour.
The first telescope to check seems to be the one from Madras, Oregon (with 2.02 minutes of total darkness. )

It's also going to be viewable from the International Space Station, woohoo !









Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

SMASH: One-Shot Model Architecture Search through HyperNetworks - implementation -

Today, we have a video, preprint and an implementation, what else can you ask for ?




Designing architectures for deep neural networks requires expert knowledge and substantial computation time. We propose a technique to accelerate architecture selection by learning an auxiliary HyperNet that generates the weights of a main model conditioned on that model's architecture. By comparing the relative validation performance of networks with HyperNet-generated weights, we can effectively search over a wide range of architectures at the cost of a single training run. To facilitate this search, we develop a flexible mechanism based on memory read-writes that allows us to define a wide range of network connectivity patterns, with ResNet, DenseNet, and FractalNet blocks as special cases. We validate our method (SMASH) on CIFAR-10 and CIFAR-100, STL-10, ModelNet10, and Imagenet32x32, achieving competitive performance with similarly-sized hand-designed networks. Our code is available at this https URL

An implementation of SMASH is here: https://github.com/ajbrock/SMASH

h/t Andrew, GitXiv



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Friday, August 18, 2017

Learning Transferable Architectures for Scalable Image Recognition / The Cost of Everything and the Value of Nothing


Hardmaru mentioned the following preprint: 


to what David replied eventually (see the thread)

Here are the two papers:


Developing state-of-the-art image classification models often requires significant architecture engineering and tuning. In this paper, we attempt to reduce the amount of architecture engineering by using Neural Architecture Search to learn an architectural building block on a small dataset that can be transferred to a large dataset. This approach is similar to learning the structure of a recurrent cell within a recurrent network. In our experiments, we search for the best convolutional cell on the CIFAR-10 dataset and then apply this learned cell to the ImageNet dataset by stacking together more of this cell. Although the cell is not learned directly on ImageNet, an architecture constructed from the best learned cell achieves state-of-the-art accuracy of 82.3% top-1 and 96.0% top-5 on ImageNet, which is 0.8% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS. This cell can also be scaled down two orders of magnitude: a smaller network constructed from the best cell also achieves 74% top-1 accuracy, which is 3.1% better than the equivalently-sized, state-of-the-art models for mobile platforms.

The Cost of Everything and the Value of NothingDavid Moloney, CTO 





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, August 17, 2017

Job: Smart Grid Data Scientist, Neuchâtel, Switzerland

Rafael let me know of the following opportunity at this Swiss non-profit corporation:


Dear Igor, 
We have an opening at CSEM (Switzerland) that might be interesting for Nuit-Blanche readers. Can I ask you to post the following announcement in your blog?
http://www.csem.ch/Page.aspx?pid=31515&jobid=107129
Best regards,
Rafael
___________________________________________________
Rafael E. Carrillo, Ph.D.
R&D Engineer
Energy Systems
rafael.carrillo@csem.ch

Wednesday, August 16, 2017

On the Expressive Power of Deep Neural Networks

In a Twitter exchange with MarkSurya mentioned four background papers from his group supporting his ICML presentation last week. Here they are:





We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows:
(1) The complexity of the computed function grows exponentially with depth.
(2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights.
(3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance.


We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.


We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.




Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Thesis: Towards Scalable Gradient-Based Hyperparameter Optimizitation in Deep Neural Networks by Jie Fu

Although a little late, congratulations Dr. Fu !



Towards Scalable Gradient-Based Hyperparameter Optimizitation in Deep Neural Networks by Jie Fu
It is well-known that the performance of large-sized deep neural networks (DNNs) is sensitive to the setting of their hyperparameters. Hyperparameter optimization is thus recognized as a crucial step in the process of applying DNNs to achieve best performance and drive industrial applications. For many years, the de-facto standard for hyperparameter tuning in deep learning has been a simple grid search. Recently, Bayesian optimization has been proposed for automatic hyperparameter tuning. However, it can hardly tune more than 20 hyperparameters simultaneously. Furthermore, the elementary- and hyper-parameter optimization tasks are usually solved separately where the hyperparameter optimization process, defined as the outer loop does not make full use of the inner elementary optimization process. To address these issues, we propose effective, efficient and scalable gradient-based methods for optimizing elementary- and hyper-parameters in DNNs in a unified manner. The first is a novel approximate method, DrMAD, for obtaining gradients with respect to hyperparameters based on asymmetric reverse-mode automatic differentiation. It is 15 ∼ 45 times faster and consumes 50 ∼ 100 times less memory on a variety of benchmark datasets compared to the state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. Inspired by the approximate nature of DrMAD, we develop an adaptive and approximate gradient-based method for optimizing elementary parameters in DNNs, which is more effective. We also propose an effective, efficient and scalable neural optimizer using a recurrent v neural network (RNN) for tuning dynamic parameter-wise hyperparameters of another DNN. The proposed neural optimizer is trained using the approximate hypergradients obtained from DrMAD. Extensive experiments show that our approach outperforms the state-of-the-art neural optimizer in terms of classification accuracy of the DNN being optimized for long horizons, but converges at least 20 times faster and consumes about 100 times less memory. To the best of our knowledge, the works described in this thesis represent the first forays into the scalable gradient-based methods for elementary- and hyper-parameter optimization in DNNs in a unified manner



The performance of deep neural networks is well-known to be sensitive to the setting of their hyperparameters. Recent advances in reverse-mode automatic differentiation allow for optimizing hyperparameters with gradients. The standard way of computing these gradients involves a forward and backward pass of computations. However, the backward pass usually needs to consume unaffordable memory to store all the intermediate variables to exactly reverse the forward training procedure. In this work we propose a simple but effective method, DrMAD, to distill the knowledge of the forward pass into a shortcut path, through which we approximately reverse the training trajectory. Experiments on several image benchmark datasets show that DrMAD is at least 45 times faster and consumes 100 times less memory compared to state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. To the best of our knowledge, DrMAD is the first research attempt to make it practical to automatically tune thousands of hyperparameters of deep neural networks. The code can be downloaded from this https URL

The DrMAD GitHub is here: https://github.com/bigaidream-projects/drmad

Monday, August 14, 2017

Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling / A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication / Effective sketching methods for value function approximation

We are starting the week with some sketching and randomized approaches ! 







Matrix sketching is aimed at finding compact representations of a matrix while simultaneously preserving most of its properties, which is a fundamental building block in modern scientific computing. Randomized algorithms represent state-of-the-art and have attracted huge interest from the fields of machine learning, data mining, and theoretic computer science. However, it still requires the use of the entire input matrix in producing desired factorizations, which can be a major computational and memory bottleneck in truly large problems. In this paper, we uncover an interesting theoretic connection between matrix low-rank decomposition and lossy signal compression, based on which a cascaded compression sampling framework is devised to approximate an m-by-n matrix in only O(m+n) time and space. Indeed, the proposed method accesses only a small number of matrix rows and columns, which significantly improves the memory footprint. Meanwhile, by sequentially teaming two rounds of approximation procedures and upgrading the sampling strategy from a uniform probability to more sophisticated, encoding-orientated sampling, significant algorithmic boosting is achieved to uncover more granular structures in the data. Empirical results on a wide spectrum of real-world, large-scale matrices show that by taking only linear time and space, the accuracy of our method rivals those state-of-the-art randomized algorithms consuming a quadratic, O(mn), amount of resources. 



In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for {directly estimating} the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an "extrapolation" technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method.

High-dimensional representations, such as radial basis function networks or tile coding, are common choices for policy evaluation in reinforcement learning. Learning with such high-dimensional representations, however, can be expensive, particularly for matrix methods, such as least-squares temporal difference learning or quasi-Newton methods that approximate matrix step-sizes. In this work, we explore the utility of sketching for these two classes of algorithms. We highlight issues with sketching the high-dimensional features directly, which can incur significant bias. As a remedy, we demonstrate how to use sketching more sparingly, with only a left-sided sketch, that can still enable significant computational gains and the use of these matrix-based learning algorithms that are less sensitive to parameters. We empirically investigate these algorithms, in four domains with a variety of representations. Our aim is to provide insights into effective use of sketching in practice.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Saturday, August 12, 2017

Saturday Morning Videos: Convolutional Neural Networks for Visual Recognition (Spring 2017 Stanford CS 231n)

Justin mentioned this on his twitter feed

















Friday, August 11, 2017

30 new Numerical Tours in Julia - implementation -

Gabriel mentiond it on his Twitter feed:
More than 30 new Numerical Tours in Julia has just been released. It consists in all the most important Numerical Tours, and covers all the topics, from Wavelet image denoising to 3D mesh parameterization. Enjoy! 
This conversion has been funded by the ERC project SIGMA-Vision, and it has been performed by Ayman Chaouki and Quentin Moret, congratulation for this nice work! 
These are the Julia tours, that can be browsed as HTML pages, but can also be downloaded as iPython notebooks. Please read the installation page for more information about how to run these tours. 
Note that it is a work in progress to port all the Numerical Tours to Julia. Help is wellcome, please refer to the GitHub repository for how to proceed.
Basics Wavelets Approximation, Coding and Compression Denoising Inverse Problems Optimization Shapes Audio Processing Computer Graphics Mesh Parameterization and Deformation Geodesic Processing Optimal Transport


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Printfriendly