Microsoft Research Publications

Microsoft Research Publications http://research.microsoft.com/apps/dp/pu/publications.aspx Keep current with all the latest Microsoft Research Publications and Technical Reports © 2014 Microsoft Corporation. All rights reserved. en-US Tue, 19 Aug 2014 23:00:08 GMT Tue, 19 Aug 2014 23:00:08 GMT 2880 DeLorean: Using Speculation to Enable Low-Latency Continuous Interaction for Cloud Gaming Gaming is very popular. Cloud gaming – where remote servers perform game execution and rendering on behalf of thin clients that simply send input and display output frames – promises any device the ability to play any game any time. Unfortunately, the reality is that wide-area network latencies are often prohibitive; cellular, Wi-Fi and even wired residential end host round trip times (RTTs) can exceed 100ms, a threshold above which many gamers tend to deem responsiveness unacceptable. In this paper, we present DeLorean, a speculative execution system for mobile cloud gaming that is able to mask up to 250ms of network latency. DeLorean produces speculative rendered frames of future possible outcomes, delivering them to the client one entire RTT ahead of time; clients perceive no latency. To achieve this, DeLorean combines: 1) future input prediction; 2) state space subsampling and time shifting; 3) misprediction compensation; and 4) bandwidth compression. To evaluate the prediction and speculation techniques in DeLorean, we use two high quality, commerciallyreleased games: a twitch-based first person shooter, Doom 3, and an action role playing game, Fable 3. Through user studies and performance benchmarks, we find that players overwhelmingly prefer DeLorean to traditional thin-client gaming where the network RTT is fully visible, and that DeLorean successfully mimics playing across a low-latency network. http://research.microsoft.com/apps/pubs/default.aspx?id=226843 Thu, 21 Aug 2014 07:00:00 GMT Telepathwords: preventing weak passwords by reading users' minds http://research.microsoft.com/apps/pubs/default.aspx?id=216722 Wed, 20 Aug 2014 07:00:00 GMT Towards reliable storage of 56-bit secrets in human memory Challenging the conventional wisdom that users cannot remember cryptographically-strong secrets, we test the hypothesis that users can learn randomly-assigned 56-bit codes (encoded as either 6 words or 12 characters) through spaced repetition. We asked remote research participants to perform a distractor task that required logging into a website 90 times, over up to two weeks, with a password of their choosing. After they entered their chosen password correctly we displayed a short code (4 letters or 2 words, 18.8 bits) that we required them to type. For subsequent logins we added an increasing delay prior to displaying the code, which participants could avoid by typing the code from memory. As participants learned, we added two more codes to comprise a 56.4-bit secret. Overall, 94% of participants eventually typed their entire secret from memory, learning it after a median of 36 logins. The learning component of our system added a median delay of just 6.9s per login and a total of less than 12 minutes over an average of ten days. 88% were able to recall their codes exactly when asked at least three days later, with only 21% reporting having written their secret down. As one participant wrote with surprise, "the words are branded into my brain." http://research.microsoft.com/apps/pubs/default.aspx?id=216723 Wed, 20 Aug 2014 07:00:00 GMT Towards reliable storage of 56-bit secrets in human memory (extended version) http://research.microsoft.com/apps/pubs/default.aspx?id=220380 Wed, 20 Aug 2014 07:00:00 GMT Appendix of Paper “Traffic Engineering with Forward Fault Correction” This is an appendix for paper "Traffic Engineering with Forward Fault Correction" published in SIGCOMM 2014. http://research.microsoft.com/apps/pubs/default.aspx?id=226739 Fri, 15 Aug 2014 07:00:00 GMT Data constrained ecological models at the speed of thought Background/Question/Methods The development of formally data-constrained models of ecological phenomena is increasingly popular. While debate about good and bad practice in data-constrained modelling rages on, few doubt that the direct incorporation of empirical evidence into the formulation of ecological models is a good idea. However, the methods involved have traditionally been technically demanding, sometimes soaking up months to years of research time, and so there is keen interest in developing computational methods that make it less of a technically demanding challenge to develop models and combine them with data appropriately within ecological research. Results/Conclusions I will describe the success we have had to date in employing the simple and fast inference library Filzbach within various research projects. Filzbach is a code library that enables Bayesian parameter inference via Markov Chain Monte Carlo, using the Metropolis Hastings algorithm, that in comparisons is one of the fastest and easiest to use inference libraries available. I will underline why we have found it particularly useful in recent published research, particularly in the development of the first fully data constrained global terrestrial carbon model, but will also highlight limitations, both with the algorithm and with the user interface that point to new and improved formal data constraining approaches. I will then go on to show our next generation of tools designed to lower the technical overhead in conducting data constrained modelling even further. I will show results from a recent study to investigate the mechanisms underpinning phytoplankton blooms in the north atlantic made within the tool as an example. In this case, the only code I wrote is the formal functional specification of the classical nutrient-phytplankton-zooplankton model – all the rest, from climate and environmental data fetch, through parameter inference, to probabilistic forecasts were all made within our new tool. We hope innovations such as these will hasten the pace at which we produce demonstrably useful predictive models in ecology. http://research.microsoft.com/apps/pubs/default.aspx?id=217350 Sun, 10 Aug 2014 07:00:00 GMT Post-quantum key exchange for the TLS protocol from the ring learning with errors problem Lattice-based cryptographic primitives are believed to offer resilience against attacks by quantum computers. We demonstrate the practicality of post-quantum key exchange by constructing ciphersuites for the Transport Layer Security (TLS) protocol that provide key exchange based on the ring learning with errors (R-LWE) problem; we accompany these ciphersuites with a rigorous proof of security. Our approach ties lattice-based key exchange together with traditional authentication using RSA or elliptic curve digital signatures: the post-quantum key exchange provides forward secrecy against future quantum attackers, while authentication can be provided using RSA keys that are issued by today's commercial certificate authorities, smoothing the path to adoption. Our cryptographically secure implementation, aimed at the 128-bit security level, reveals that the performance price when switching from non-quantum-safe key exchange is not too high. With our R-LWE ciphersuites integrated into the OpenSSL library and using the Apache web server on a 2-core desktop computer, we could serve 506 RLWE-ECDSA-AES128-GCM-SHA256 HTTPS connections per second for a 10 KiB payload. Compared to elliptic curve DiffieHellman, this means an 8 KiB increased handshake size and a reduction in throughput of only 21%. This demonstrates that post-quantum key-exchange can already be considered practical. http://research.microsoft.com/apps/pubs/default.aspx?id=226372 Tue, 05 Aug 2014 07:00:00 GMT U-Prove extensions The U-Prove Cryptographic Specification focuses on the core U-Prove capabilities; the specified features were selected to simplify implementation and integration into existing systems, while meeting the needs of a wide array of scenarios. By design, the specification provides extension points, making it possible to extend the core capabilities to meet additional needs. This paper describes recently released features compatible with the U-Prove technology. The reader is assumed to be familiar with the technology, and is referred to the technology overview for an introduction. http://research.microsoft.com/apps/pubs/default.aspx?id=226360 Tue, 05 Aug 2014 07:00:00 GMT Well Begun is Half Done: Generating High-Quality Seeds for Automatic Image Dataset Construction fromWeb We present a fully automatic approach to construct a large-scale, high- precision dataset from noisy web images. Within the entire pipeline, we focus on generating high quality seed images for subsequent dataset growing. High quality seeds are essential as we revealed, but they have received relatively less attention in previous works with respect to how to automatically generate them. In this work, we propose a density score based on rank-order distance to identify positive seed images. The basic idea is images relevant to a concept typically are tightly clustered, while the outliers are widely scattered. Through adaptive thresholding, we guarantee the selected seeds as numerous and accurate as possible. Starting with the high quality seeds, we grow a high quality dataset by dividing seeds and conducting iterative negative and positive mining. Our system can automatically collect thousands of images for one concept/class, with a precision rate of 95% or more. Comparisons with recent state-of-the-arts also demonstrate our method’s superior performance. http://research.microsoft.com/apps/pubs/default.aspx?id=226897 Fri, 01 Aug 2014 07:00:00 GMT Structured Information Extraction from Natural Disaster Events on Twitter As soon as natural disaster events happen, users are eager to know more about them. However, search engines currently provide a ten blue links interface for queries related to such events. Relevance of results for such queries can be significantly improved if users are shown a structured summary of the fresh events related to such queries. This would not just reduce the number of user clicks to get the relevant information but would also help users get updated with more fine grained attribute-level information. Twitter is a great source that can be exploited for obtaining such fine-grained structured information for fresh natural disaster events. Such events are often reported on Twitter much earlier than on other news media. However, extracting such structured information from tweets is challenging because: 1. tweets are noisy and ambiguous; 2. there is no well defined schema for various types of natural disaster events; 3. it is not trivial to extract attribute-value pairs and facts from unstructured text; and 4. it is difficult to find good mappings between extracted attributes and attributes in the event schema. We propose algorithms to extract attribute-value pairs, and also devise novel mechanisms to map such pairs to manually generated schemas for natural disaster events. Besides the tweet text, we also leverage text from URL links in the tweets to fill such schemas. Our schemas are temporal in nature and the values are updated whenever fresh information flows in from human sensors on Twitter. Evaluation on ∼58000 tweets for 20 events shows that our system can fill such event schemas with an F1 of ∼0.6. http://research.microsoft.com/apps/pubs/default.aspx?id=226896 Fri, 01 Aug 2014 07:00:00 GMT Travel Time Estimation of a Path using Sparse Trajectories In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources. Though this is a strategically important task in many traffic monitoring and routing systems, the problem has not been well solved yet given the following three challenges. The first is the data sparsity problem, i.e., many road segments may not be traveled by any GPS-equipped vehicles in present time slot. In most cases, we cannot find a trajectory exactly traversing a query path either. Second, for the fragment of a path with trajectories, they are multiple ways of using (or combining) the trajectories to estimate the corresponding travel time. Finding an optimal combination is a challenging problem, subject to a tradeoff between the length of a path and the number of trajectories traversing the path (i.e., support). Third, we need to instantly answer users’ queries which may occur in any part of a given city. This calls for an efficient, scalable and effective solution that can enable a citywide and real-time travel time estimation. To address these challenges, we model different drivers’ travel times on different road segments in different time slots with a three dimension tensor. Combined with geospatial, temporal and historical contexts learned from trajectories and map data, we fill in the tensor’s missing values through a context-aware tensor decomposition approach. We then devise and prove an object function to model the aforementioned tradeoff, with which we find the most optimal concatenation of trajectories for an estimate through a dynamic programming solution. In addition, we propose using frequent trajectory patterns (mined from historical trajectories) to scale down the candidates of concatenation and a suffix-tree-based index to manage the trajectories received in the present time slot. We evaluate our method based on extensive experiments, using GPS trajectories generated by more than 32,000 taxis over a period of two months. The results demonstrate the effectiveness, efficiency and scalability of our method beyond baseline approaches. http://research.microsoft.com/apps/pubs/default.aspx?id=217493 Fri, 01 Aug 2014 07:00:00 GMT Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing Many fields of science and engineering, ranging from predicting protein structures to building machine translation systems, require large amounts of labeled data. These labeling tasks have traditionally been performed by experts; the limited pool of experts would limit the size of the datasets, and make the process slow and expensive. In recent years, there is a rapidly increasing interest in using crowds of semi-skilled workers recruited through the Internet. While this 'crowdsourcing' can cheaply produce large amounts of labeled data in short times, it is typically plagued by the problem of low quality. To address this fundamental challenge in crowdsourcing, we design a novel reward mechanism for acquiring high-quality data, which incentivizes workers to censor their own low-quality data. Our main results are the mathematical proofs showing that surprisingly, under a natural and desirable 'no-free-lunch' requirement, this is the one and only mechanism that is incentive-compatible. The simplicity of the mechanism is an additional attractive property. In preliminary experiments involving over 900 worker-tasks, we observe upto a three-fold drop in the error rates under this unique incentive mechanism. http://research.microsoft.com/apps/pubs/default.aspx?id=226860 Fri, 01 Aug 2014 07:00:00 GMT Ziria: Language for Rapid Prototyping of Wireless PHY Software-defined radios (SDR) have the potential to bring major innovation in wireless networking design. However, their impact so far has been limited due to complex programming tools. Most of the existing tools are either too slow to achieve the full line speeds of contemporary wireless PHYs or are too complex to master. In this demo we present our novel SDR programming environment called Ziria. Ziria consists of a novel programming language and an optimizing compiler. The compiler is able to synthesize very efficient SDR code from high-level PHY descriptions written in Ziria language. To illustrate its potential, we present the design of an LTE-like PHY layer in Ziria. We run it on the Sora SDR platform and demonstrate on a test-bed that it is able to operate in real-time. http://research.microsoft.com/apps/pubs/default.aspx?id=226858 Fri, 01 Aug 2014 07:00:00 GMT Sketch-based Influence Maximization and Computation: Scaling up with Guarantees Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1−1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size. We develop a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes. http://research.microsoft.com/apps/pubs/default.aspx?id=226623 Fri, 01 Aug 2014 07:00:00 GMT Distance Queries from Sampled Data: Accurate and Efficient Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have. We derive novel estimators for estimating L_p distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties. We study the performance of our Manhattan and Euclidean distance (p=1,2) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data. http://research.microsoft.com/apps/pubs/default.aspx?id=226637 Fri, 01 Aug 2014 07:00:00 GMT Skolemization Modulo Theories Combining classical automated theorem proving techniques with theory based reasoning, such as satisfiability modulo theories, is a new approach to first-order reasoning modulo theories. Skolemization is a classical technique used to transform first-order formulas into equisatisfiable form. We show how Skolemization can benefit from a new satisfiability modulo theories based simplification technique of formulas called monadic decomposition. The technique can be used to transform a theory dependent formula over multiple variables into an equivalent form as a Boolean combination of unary formulas, where a unary formula depends on a single variable. In this way, theory specific variable dependencies can be eliminated and consequently, Skolemization can be refined by minimizing variable scopes in the decomposed formula in order to yield simpler Skolem terms. http://research.microsoft.com/apps/pubs/default.aspx?id=217932 Fri, 01 Aug 2014 07:00:00 GMT Single cell analyses of regulatory network perturbations using enhancer targeting TAL Effectors suggest novel roles for PU.1 during haematopoietic specification http://research.microsoft.com/apps/pubs/default.aspx?id=226358 Fri, 01 Aug 2014 07:00:00 GMT Towards Synthesizing Executable Models in Biology http://research.microsoft.com/apps/pubs/default.aspx?id=226690 Fri, 01 Aug 2014 07:00:00 GMT An Introduction to Computational Networks and the Computational Network Toolkit We introduce computational network (CN), a unified framework for describing arbitrary learning machines, such as deep neural networks (DNNs), computational neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM), logistic regression, and matrixum entropy model, that can be illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each non-leaf node represents a matrix operation upon its children. We describe algorithms to carry out forward computation and gradient calculation in CN and introduce most popular computation node types used in a typical CN. We further introduce the computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU. We describe the architecture and the key components of the CNTK, the command line options to use CNTK, and the network definition and model editing language, and provide sample setups for acoustic model, language model, and spoken language understanding. We also describe the Argon speech recognition decoder as an example to integrate with CNTK. http://research.microsoft.com/apps/pubs/default.aspx?id=226641 Fri, 01 Aug 2014 07:00:00 GMT A Probabilistic Model for Learning Multi-Prototype Word Embeddings Distributed word representations have been widely used and proven to be useful in quite a few natural language processing and text mining tasks. Most of existing word embedding models aim at generating only one embedding vector for each individual word, which, however, limits their effectiveness because huge amounts of words are polysemous (such as \emph{bank} and \emph{star}). To address this problem, it is necessary to build multi embedding vectors to represent different meanings of a word respectively. Some recent studies attempted to train multi-prototype word embeddings through clustering context window features of the word. However, due to a large number of parameters to train, these methods yield limited scalability and are inefficient to be trained with big data. In this paper, we introduce a much more efficient method for learning multi embedding vectors for polysemous words. In particular, we first propose to model word polysemy from a probabilistic perspective and integrate it with the highly efficient continuous Skip-Gram model. Under this framework, we design an Expectation-Maximization algorithm to learn the word's multi embedding vectors. With much less parameters to train, our model can achieve comparable or even better results on word-similarity tasks compared with conventional methods. http://research.microsoft.com/apps/pubs/default.aspx?id=226629 Fri, 01 Aug 2014 07:00:00 GMT