%We have been working on Microsoft's cloud : Azure. In this section we introduce cloud computing, Azure services and storage, give some key ideas of cloud apps design. Some specificities of the cloud computing hardware impose new constraints on our algorithm. We describe the hardware performances and how it impacts our algorithm, before running some cost modelisation and running experiments. % %\subsection{Cloud/Azure Introduction} % %Cloud computing is a set of innovative technologies (Virtualisation, distributed storage, MORE ON THIS, see \cite{lenk2009witcaamotcl} ...) merged together into a coherent framework to commoditize IT (information technology). As electricity became in the 1930's available to consumers when national electricity grid were developed, cloud computing is a crucial step to provide computation and storage supplies as a service to any customer. Many papers have been published recently about cloud HPC performances and how it can impact scientific research : \cite{BridgingtheGap}, \cite{BenchmarkingAmazonEC2}, \cite{HighPerformanceComputingWithClouds}, \cite{TOP500}, \cite{BenchmarkingCloudServices},... We recommend this very nice introduction to cloud computing \cite{lenk2009witcaamotcl}. %As reported in \cite{BenchmarkingCloudServices}, each cloud system has developed different strategies and API to address different needs. Thus, it is difficult to compare the different cloud offers. Yet, we can isolate different cloud levels : % %\begin{itemize} %\item \textbf{Software as a Service} (SaaS) : applications are exposed as a service running on a cloud infrastructure. Little freedom is left to users except using the application build on top of the cloud. Examples include GMail, Salesforce.com, Zoho etc... %\item \textbf{Platform as a Service} (PaaS) : users are given abstractions and/or framework to develop applications. The framework provides them "easy" environment to run their code. Storage and/or workers are abstracted. Main examples are Amazon Simple Storage Service (S3), Microsoft Azure, BungeeLabs... %\item \textbf{Infrastructure as a Service} (IaaS) : computing resources (compute, storage, and network) are exposed as a capability. The users are relieved of the burden caused by owning, managing or controlling the underlying infrastructure. Main example is Amazon Elastic Cloud Compute (EC2). %\end{itemize} % %Since SaaS offers little except applications already build, we are interesting in the choice between Paas and IaaS. %These cloud technologies show different choices on some tradeoffs : %\begin{itemize} %\item \textbf{abstraction/control tradeoff} : The higher the control users gets, the lower they can have on-the-shelves abstraction. PaaS will provide higher abstraction to develop applications than IaaS, but the application will have to conform and fit to the framework provided and some things couldn't be achieved in PaaS (GIVE AN EXAMPLE). %\item \textbf{scalability/cost to develop tradeoff} : Running your application on a PaaS grants you great scalability confort. Yet, this scalability only comes as the result of a strong redesign of your application so it fits the framework provided (Azure framework is explained below) : scalability comes as a side effect of the redesign of the application. On the contrary, running your application on an IaaS offer lets you choose : you can run your application as it is, and scalability is bounded by the present design of your application, or you can redesign your application and scalability is improved. %\end{itemize} % %MAKE A PARAGRAPH HERE TO EXPLAIN WHY RUNNING AZURE AND INTRODUCE LOKAD.CLOUD.\\ % %Presently, Azure is designed to provide PaaS. As explained in Azure FAQs, Microsoft is planning to add Virtual Machines (VM) functionality to enrich the different uses of Azure. Therefore, Azure might also become Infrastructure as a Service (IaaS). Yet, since no communication or services have been released yet, we will assume Azure is only PaaS oriented.\\ %\subsection{other tradeoffs in Azure} % %\textbf{Read/Write tradeoff} : As reported in \cite{BenchmarkingCloudServices}, distributed storage must choose a tradeoff between read throughput and write throughput. Taking the example of a blob in BlobStorage, a blob cannot be at the same time highly available in read operations and write operations (having a highly available blob for read operations implies the blob to be duplicated, so write operations must change several replicas, and write operations take more time). Azure let the traffic arbitrate this tradeoff : the more request to a blob per time unit, the more the blob is duplicated, the more it becomes available, and the more time it takes to run write operations.\\ % %\textbf{Responsivity/Efficiency} \cite{BenchmarkingCloudServices} is also reporting some tradeoff between services responsivity and efficiency. Efficiency, as defined as the ratio of work completed per hour per worker is a decreasing function of workers and an increasing function of total workload. Therefore, to achieve the best possible efficiency, we should make sure every worker is overworked. Yet, the more overworked workers are, the least responsive they can be on an additional task. Since the first goal of our algorithm is responsivity (we want to minimize amount of time spent processing the algorithm), we will make sure to provide a pool of workers slightly greater than the number of jobs we can parallelize, so we make sure all jobs that can be run parallelingly, are (having P jobs run by P-1 workers would lead approximately to twice the time P jobs running by P or P+1 workers). \\ % %\textbf{Synchronous Replication/Replication Latency tradeoff} : TO BE DIGGED. RELATED TO CONSISTENCY %The Azurescope website \cite{AzureScope} is reporting performance benchmark run from February/March 2010 to presently (June 2010) on Azure hardware. %Some paper have also been released very recently on same measures \cite{Early_Observations_On_Azure}. %Numbers reported in \cite{Early_Observations_On_Azure} seem to be more accurate than \cite{AzureScope}, and the methodology is better explained, so we will use these numbers as reference. %We will use the numbers observed to understand if bandwidth is an issue on Azure. % %\begin{itemize} %\item BlobStorage Read Bandwidth : 13 MBytes/sec for 1 worker, BlobStorage aggregated read bandwidth bounded to 393 MBytes/sec when using more workers (each worker is using several threads to load data from storage). This quantity describes the bandwidth for several workers to download the same blob from the storage. %\item BlobStorage Write Bandwidth : 5 MBytes/sec for 1 worker, BlobStorage aggregated write bandwidth bounded to 124 MBytes/sec when using more workers (each worker is using several threads to upload data). This quantity describes the bandwidth for one or several workers to load data from different blobs but within the same container. %\item Communication throughput between 2 workers : 55 MBytes/sec on large messages on average (which is a very impressive bandwidth). %\item CPU power depends on type of VM allocated : from 1 core (on small VM) to 8 cores on extra-large VM %\item Queue throughput is presently 500 messages per second, but might be very increased soon. %\end{itemize} % %Note : Since concurrent write on a blob is very costly, we will not write twice at a same place, unless when using counters.\\ % %Before trying to estimate possible speed-up, we have to redesign the algorithm so it fits the hardware constraints. In this subsection, we will describe some key points in the design of cloud applications in general, and K-Means in particular :\\ % %\begin{itemize} %\item \textbf{Services Oriented Architecture} Cloud applications are inspired by SOA (Services Oriented Architecture) and will be divided into cloud services. Each cloud service will be associated to a specific queue and will be responsible for one logical part of the application. Services will communicate using queues and storage : when a service A is done and need to notify another service B it can start, service A will be pushing computation result in the storage into one or several blobs and will push a new message in queue associated to service B, the message containing the key of blobs pushed so service B can load back data from storage. This design helps the different services of the application to be loosely coupled and scalable. The storage on Azure has been designed not only to store data and results, but also to endorse the shared memory role in SMP architectures (with some limitations, though). %\item \textbf{No Master} DEVELOP THIS %\item \textbf{No worker is more efficient than another one for a specific task }. Work can be divided into tasks. No matter which worker is assigned on which task, because each worker can be as efficient as the others on each task. A very naive load balancer will dispatch available workers on the different queues to process the messages. %\item \textbf{No worker/data affinity is provided but it is necessary} Azure is not providing any affinity between data and workers. Data are abstracted into BlobStorage and TableStorage and it is impossible by design to try to run jobs on CPUS that are physically near the place data are stored. CITE GOOGLE MAP REDUCE. DEVELOP HOW WE WILL MANAGE THIS WHILE AVOIDING ITERATION THROUGH MESSAGES. LOOK FOR XMPP PROTOCOL. %\item \textbf{Scaling up is a developer initiative} Through the management API, administrator can change the number of workers available. Then, we are no more in a fixed hardware framework as K-Means run in other papers experiments. We can modulate the hardware so it better fits our need. Therefore we can ask a new question : how many workers do we need to minimize the whole time of the algorithm ? %\item \textbf{There are no strict guarantees on the number of workers actually running }. Due to the huge amount of time necessary to set up workers \cite{Early_Observations_On_Azure} (which is going to be improved), workers will be instanciated before the experiments are run and their number is supposed to be constant over time. Yet, since a worker can be shutdown, a VM deplaced or paused arbitrarily by Azure without notification, the number of workers is not guaranteed to be exactly the number requested at each instant. %\item \textbf{Pinging queues as a tradeoff between cost and simplicity}. Workers will be pinging queues to detect if there are some messages to process. Since even QueueStorage operations are charged, we can wonder whether pinging queues can be avoided, and whether it is affordable. Indeed, we could open a TCP connection on each worker, set-up a listener so each worker is listening for notification. Each time a notification is sent, the listener tells the worker to ping the queue and get the message, or the notification is the message by itself and worker is directly told to process the message. In the first case, we do not ping queues unless a message needs to be processed. In the second case, we do not ping queues at all. We have been choosing to not implement listeners patterns and TCP connections, for simplicity reasons. Pinging a queue every 100 ms for a worker means a 315 millions request per worker per year, charged as 3000 dollars (see \cite{Azure_Pricing}). We have been adopting a back-off strategy : each time we ping an empty queue, we double time before pinging again the queue CHECK THIS IN LOKAD.CLOUD. IN THE EXPERIMENTATION PART, GIVE HOW MANY TIMES THE QUEUE IS PINGED %\item \textbf{Monitoring how many messages are delivered} It is not possible to get an exact count of how many messages are stored in some queue, we only can get an approximative count of queued messages when asking to the queue its metadata. When a service needs to process P messages, then need to push a message in another queue to launch a new service, we need to get an exact count of remaining messages to process. To achieve this, we are using blobcounters, blobs storing an integer and decremented by each worker using etag properties when the worker is done with a message. Since the worker can fail at any moment, and a failure means the message is requeued after some delay and is going to be processed again, we decrement the counter in the end, when the job is finished and the message is going to be deleted. Thus, the only way a single message can lead to decrement twice the counter is to make the worker fail exactly when it is noticing the queue to delete the message. Theoretically, the counter could therefore be decremented twice (with a very low probability). One way to deal with this issue would be to use an idempotent counter : % \begin{enumerate} % \item we push P messages in the queue. Each message has a unique Id between 1 and P % \item we push a binary counter of P bits set to 1111..1111 in the blobStorage % \item each time the job stored in message i is completed, we update using etag the counter and apply to the counter a \& operation 111110111111 where 0 is at index i. This way we made a counter where decrements will be idempotent. % \end{enumerate} % Even if getting number of messages stored in a queue is approximate, Azure is providing a transactional count of how many times a message is dequeued. FINISH HERE %\item \textbf{Jobs must be idempotent} Since messages can be consummed more than once, and some worker can stop in the middle of a message processing before another worker restart the message, we must design our jobs so jobs stored in messages are idempotent, that is to say running twice a job won't affect final result. %\item \textbf{Workers Communication} On DMM architecture, we saw K-Means was distributed by using MPI to broadcast data across the different CPU. Communication between workers was the key point of the algorithm : the ratio bandwidth/CPU frequency was driving speed-up efficiency. When designing the algorithm for the cloud, we must keep in mind that we will run K-Means on a hardware where bandwidth will not be as good as in Dhillon's experiments, and CPU frequency higher. Therefore, on a same number of points to cluster, the speed-up factor will probably get worse. % Here, we have 2 different designs available : % \begin{itemize} % \item Workers are communicating directly. Whenever a loop is completed, local centroids are communicated to another worker directly through TCP/IP. Workers are behaving like in a peer-to-peer environment. % \item Local results are stored in the Azure storage, and workers are communicating through queues and storage to inform data have been processed and where we can find back the results. % \end{itemize} % % This is a key point in the design on the algorithm. First version is the "direct" translation of DMM K-Means architecture. Yet, direct communication will come with a lot of code overhead : % \begin{itemize} % \item we have to make sure each worker has a list of other workers and their "address". % \item we have to deal with failures. Especially, each worker must keep a track of available other workers, we must add code to handle failures, ... % \end{itemize} % % Performance of "direct inter-worker communication KMeans" is expected to be better, because we do not have to use the storage as a middleman between each couple of communicating workers, and interworker communication bandwidth is reported to be better than storage communication bandwidth. Yet, code overhead to get a robust system using direct communication in a context of industrial constraints is so important that we are going to implement both design, and compare the two designs in term of performance.\\ % % %\end{itemize}