Schedule
This is a tentative schedule and will change.
Grading details
Assessment Type | Weightage | Additional Information |
---|---|---|
Presentations and Discussions | 40% | |
Programming Assignments | 50% | |
Class Participation | 10% |
Class Schedule
Overview
- Overview of the rest of the semester
- Lecture
Initial Presentations
Traffic Prediction
Tuesday, Feb 27:
Paper 1
- Paper 1: AI/ML-based real-time classification of Software Defined Networking traffic Link Slides Discussion Notes
- Speaker: Yoshi
- Slides
- Questions:
-In section 3.3 what does delta packets, instant bytes per seconds and instant packets per second features mean?
- Q2: What are the preprocessing steps in this study?
- Q3: What are the applications/benefits of categorizing the traffic?
- Q4: How do the Machine Learning algorithms compare with state-of-the-art traditional algorithms for SDN traffic classification?
- Q5: what are the tradeoffs to each of the supervised learning algorithms and how to they apply to a specific network scenario?
- Q6: Is it real- time? How can we measure that it is actual real-time?
- Q7: The data was divided into four categories, but is there any indication of what most influenced those categories?
- Q8: Does the report offer any explanation for the telnet misclassifying for the LR and Naive Bayes and/or did they offer any insight on what could be improved upon for better results?
- How do the Machine Learning algorithms compare with state-of-the-art traditional algorithms for SDN traffic classification?
- In section 3.3 what does delta packets, instant bytes per seconds and instant packets per second features mean?
- What are the applications/benefits of categorizing the traffic?
Thursday, Feb 29:
Code Review
Tuesday, Mar 5:
Paper 2
- Predicting Future Traffic using Hidden Markov Models Link Slides Discussion Notes
- Speaker: Jestus
- Slides
- Questions:
- Predicting Future Traffic using Hidden Markov Models - What is the extent of the state the Markov State is encapsulating?
- Why is it expensive to directly capture traffic. And, what their definition of a flow count, and how is collecting that easier than collecting traffic volume?
- What are the trade offs to each supevised learning alg. and how do each apply to the scenario?
- My understanding of Markov property is future state depends only on the current state. So, for predicting flow volume in network, does this assumption hold in real world scenarios of network?
- How is KBR related to HMM?
- Why is it expensive to directly capture traffic. And, what is their definition of a flow count, and how is collecting that easier than collecting traffic volume?
Paper 3
- Flow-based Throughput Prediction using Deep Learning and Real-World Network Traffic Link Slides Discussion Notes
- Speaker: Nick
- Slides
- Questions:
-Given that the paper tries to predict bit rate of network traffic flow, I’d like to understand why this is treated as a classification task and not as a regression task. What are the advantages to this approach?
- Compare both binary classification and multi-class classification, what are the advantages of using multi-class classification?
- What is the output, a number representing bitrate or a class attribute?
- Can the model be helpfully accurate when predicting continuous variables?
- Why are some features better than others?
- What are the scatter plot cluster supposed to demonstrate ?
- Why are bit rate informed routing discissions going to be better than what is currently present ? How do you expect these developments to effect routing efficiency? Will performance gains be enough to justify the over head of the proposal?
- What is are the anticipated impacts of the model not considering the longest lasting flows (those greater than 5mins)? I would think these would have the highest impact on resources utilization.
- How close are the inaccurate predictions ?
- Are only flow features ( like 5-tuple described in paper) enough for Throughput prediction? Can we use this model in dynamic network conditions if we include topology information as additional features?
- Given that the paper tries to predict bit rate of network traffic flow, I’d like to understand why this is treated as a classification task and not as a regression task. What are the advantages to this approach?
- The paper mentions quantizing bitrates into three classes instead of the common “mice” and “elephant” flow binary classification. why these three classes are chosen?
Traffic Classification
Thursday, Mar 7:
Paper 4
- Resource Management with Deep Reinforcement Learning Link Slides Discussion Notes
- Speaker: Rajat
- Slides
- Questions:
- Is it possible to apply this approach for non-preemptive scheduling?
- Is it possible to use this approach for real-time jobs since it is based on reinforcement learning and learn online from environment?
- What are the computational and resource requirements for training and deploying the DeepRM model in a real-world cluster environment, and how do they scale with the size and complexity of the cluster and workload?
- What are some potential problems that can arise if Deep RN were to be deployed on a real-world network instead of a pre-constructed network for it to train?
- What is the difference between Completion Time (Cj) and duration (Tj) of the job?
- How does DeepRM handle scenarios where resources are dynamically changing or where resource demands fluctuate over time?
- Is there any indication in this paper of issues scaling this trained network into a larger system?
- What is are the anticipated impacts of the model not considering the longest lasting flows (those greater than 5mins)? I would think these would have the highest impact on resources utilization.
- How can we leverage Deep RL techniques to automatically learn efficient resource management in complex computer systems and networks?
- What are some potential problems that can arise if Deep RN were to be deployed on a real-world network instead of a pre-constructed network for it to train?
- What are the impacts on wait times for longer jobs that are withheld? is this a fair system?
- When does the actual jobs get run and how does that feedback get sent to the scheduler?
- Are there advantages to continuously training the algorithm?
- What other things can you train the reinforcement algorithm on?
- Would the algorithm do better if it were seated with a high performance scheduler to start and then reinforce based on that?
Paper 5
- Selecting critical features for data classification based on machine learning methods Link SlidesDiscussion Notes
- Speaker: Xi
- Slides
- Questions:
- What are Po and Pe in equation 9?
- How much do the feature selections depend on model trying to predict?
- Can we combine feature selection methods based on the model we want to preprocess for better performance?
- Considering the significant variation in accuracy observed across different feature selection methods (RF, RFE, Boruta) and classifier algorithms (RF, SVM, KNN, LDA), what can we infer, and when would an algorithm be preferred in a real-world scenario?
- Is the importance of features derived from machine learning algorithms consistent across all target classes of classification task? Some features can be more important for predicting one class than another. What impact does this have on the feature selection process?
- compare both binary classification and multi-class classification, what are the advantages of using multi-class classification?
- Can the feature selection methods handle imbalanced datasets effectively?
- Selecting critical features for data classification based on machine learning methods: What peculiarities of the dataset might have influenced the performance of one methodology or another?
- Is there ever a time to use the lower preforming algorithms ?
- The paper assumes that the relationship between flow counts and traffic volumes, modeled by the transition and emission probabilities in the HMM, does it remain stationary over time?
- What is the time saving we get from applying these algorithms ?
- Is there ever a time to use the lower preforming? algorithms
- What is the time saving we get from applying these algorithms
- Is there ever a time to use the lower preforming algorithms
Thursday, Mar 19:
Paper 6
- A general approach for traffic classification in wireless networks using deep learning Link SlidesDiscussion Notes
- Speaker: Jacob
- Slides
- Questions:
- Is there any bias for the CNN model, and if so where is it occurring? When is trading time complexity for a higher accuracy worth it in a real-time implementation?
- A General Approach for Traffic Classification in Wireless Networks Using Deep Learning: It looks like they used three models that does very related tasks: for example: one classifies if the packet is data or others, second classifies if it is a video or others and third if it is one of the video apps or others. Can multi-task learning model be used here to train one model that does all these tasks at once?
- How does the performance of the proposed Traffic Classification (TC) framework change under different Signal-to-Noise Ratio (SNR) conditions?
- What are the preprocessing steps for the data collection step in the traffic classification system using spectrum data?
- Most of the wireless categories that the paper noted as classifiers were for entertainment applications. Why might these specific classifiers been used?
- How can we leverage deep learning techniques directly on raw spectrum data in complex wireless network environments?
- How does the performance of the proposed Traffic Classification (TC) framework change under different Signal-to-Noise Ratio (SNR) conditions?
- What are the preprocessing steps for the data collection step in the traffic classification system using spectrum data?
- Most of the wireless categories that the paper noted as classifiers were for entertainment applications. Why might these specific classifiers been used?
- Why is CNN so good at classifying signals. Are there other ML modules that could be similarly leveraged?
- Are there other signals that can be leveraged for classification using this methodology? (Heart Rate, Light etc. )
- How can we leverage deep learning techniques directly on raw spectrum data in complex wireless network environments?
- What are the authors’ plans for using the three models’ results
Paper 7
- Pilot-Edge: Distributed Resource Management Along the Edge-to-Cloud Continuum Link SlidesDiscussion Notes
- Speaker: Sepideh
- Slides
- Questions:
- Based on the experimental results, what are some trade-offs that need to be considered when deploying machine learning models in edge-to-cloud environments? How can these trade-offs be optimized for better performance?
- Pilot-Edge: Distributed Resource Management Along the Edge-to-Cloud Continuum : How does the Pilot-Edge support dynamism of resources: expanding and scaling down dynamically at runtime? Does it depend on Dask or has its own mechanism?
- Why did the researchers decide to go with the auto-encoder, isolation forests, and k-means models over other machine learning models?
- Considering the use case of Pilot-Edge is to distribute resource usage across available devices, why is there a significant distinction made between an edge device and a cloud, when both are made of distinct devices, even if they are geographically distinct?
- How can we design an framework for IoT applications, especially those involving heterogeneous machine learning tasks?
- I understand that mobile devices act as edge devices in the edge-to-cloud continuum. For data generated on mobile phones, how does Pilot-Edge determine whether to process this data locally or offload the task based on the processing task’s complexity? Does this require installing an agent on the mobile phones, or is there another method of implementation?
- Why did the researchers decide to go with the auto-encoder, isolation forests, and k-means models over other machine learning models?
- Considering the use case of Pilot-Edge is to distribute resource usage across available devices, why is there a significant distinction made between an edge device and a cloud, when both are made of distinct devices, even if they are geographically distinct?
- What are examples of specific applications for this system?
- The conclusion mentions the future integration of other layer types, what are examples of these layers and when would they bee needed?
- Could this model me leveraged for offline applications, (iot devices and locally accessible recourses stores)?
- I have a hard time seeing the direct ML implications of this paper (aside for just allowing ML applications to be accessed via cloud by IoT applications) can you further explain them?
- How can we design an framework for IoT applications, especially those involving heterogeneous machine learning tasks?
Resource management
Thursday, Mar 21:
Paper 8
- Detection and Classification of Botnet Traffic using Deep Learning with Model Explanation Link SlidesDiscussion Notes
- Speaker: Jonathan
- Slides
- Questions:
- How does pretraining CNN on unlabelled data and retraining on labelled data work? (Section II A 3rd paragraph)
- How does Multi-Task Learning (MTL) work? Is it just a term for a ML model that has more than one target variable?
- Does the radio network stack align with the internet’s 4 layer architecture, with the link layer being the only difference?
- To create dataset, the paper uses MATLAB WLAN toolbox to generate L1 waveforms from L2 pcap files. What are the advantages to this approach compared to capturing L1 traffic directly?
- Does this paper consider data privacy and ethic in training model?
- How can me make this approach more efficient in terms of scalability?
- How could the synthetic dataset generate better data for this task?
- Based on the findings of this study, what are some potential avenues for future research in the field of botnet detection and classification?
- The paper talks about the importance of model interpretability along with learning the model. Can there be trade-offs between model performance to accurately detect and classify botnet traffic and model interpretability?
- What are the feature extraction process from network traffic for botnet classification? The paper did not talk about it very well.
- What constitues a sample in the dataset / What are the features? The paper states that the feature extraction module extracts 199 features.
- Why is it better for making the resource central an independent and general system that is off the critical performance and availability paths of the systems that use it?
- Why was architecture 12, an approach which had a high but not the highest F1-Score, chosen? Were their advantages to choosing another architecture, such as 11?
- Can this be used to classify other types of flows like ransomware?
- Can this be adapted to be used online as a IDS ?
- How the deep learning techniques classify botnet traffic in real-world network environments?
Question2: How can me make this approach more efficient in terms of scalability?
<details>
<summary>Paper 9</summary>
- Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms[Link](https://dl.acm.org/doi/pdf/10.1145/3132747.3132772) [Slides](https://docs.google.com/presentation/d/17dMuEX3PDFOoG8JycPVkCcqFcC2XHNBA/edit?usp=drive_link&ouid=100211778554961279713&rtpof=true&sd=true)[Discussion Notes]()
- Speaker: Zach
- [Slides](#)
- Questions:
- Can workload prediction model be generalized across different cloud platform?
- How can workload prediction models be merged or extended to the cloud services?
- Where the metrics outcomes for the models considered good enough for implementation in a real-time network, or are higher metrics required?
- The dataset used in the paper contains information of VMs of just 3 months. However, cloud environments are dynamic that can change in workload patterns, resources, and system configurations. How to make sure Resource Central can adapt to such changes and provide long term predictions?
- In what ways can accurate predictions of VM behavior improve the performance and reliability of cloud services?
- The "smart power oversubscription and capping" in page 8 implies that some VMs batch and background tasks would lose significant computational power. Isn't this unethical, given that the customer has already paid for these VMs via subscrition?
- Does the paper imply that large cloud platforms trend towards similar utilizations of Virtual Machines, such that predicting them becomes simpler using the paper's approach? If not, then how does the approach accommodate for differing services beyond retraining the model for each large cloud platform?
- Understanding and Predicting Workloads for Improved Resource: Management in Large Cloud Platforms: What are the key characteristics and patterns of VM workloads be learned and predicted more efficient?
TBD
Tuesday, Mar 26:
Discussions
Tuesday, Apr 2:
Disucssions
Thursday, Apr 4:
Discussions
Tuesday, Apr 9:
Project and Discussions
Thursday, Apr 11:
Papers
Network-accelerated Distributed Machine Learning for Multi-tenant Settings(#)
Speaker: Yoshi
- Slides
- Discussion Notes
- Questions:
- What is the scalability in this approach? Is there any limitation for their method to be scalable? How scalable is MLfabric in extremely large multi-tenant environments with thousands of tenants?
- Are there any specific security or privacy concerns associated with the replication and data aggregation strategies employed by MLfabric, especially in environments where data sensitivity is a critical issue?
- What were the preprocessing steps that this paper uses?
- Given that the A-30 and A-60 approaches required the introduction of a bounded delay, what applications were mentioned that already have a delay that would make this approach easier to introduce and justify?
- Is there any technological crossover to other similarly designed ML models, architectures like those discussed in Polit Edge, and congestion management in traditional networks? Also, could this approach have value for IoT sensors where each computes its part and forwards it to the centralized mode?
- Why not use more centralized resources instead? What is the advantage of higher different delay bounds? Why was the accuracy so low? Is the legitimacy of their results questionable because they were performed in a controlled environment? Can dynamic update scheduling improve the convergence speed?
TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning(#)
Speaker: Xi
- Slides
- Discussion Notes
- Questions:
- How adaptable is TensorExpress to dynamic network conditions such as varying bandwidth? Is P4 common, and is TensorExpress compatible with existing network infrastructure?
- How does TensorExpress ensure that communication optimizations do not compromise the integrity of model updates during training?
- Would controlling the burst rate of workers help alleviate some of the problems?
Speaker: Jestus
- Slides
- Discussion Notes
- Questions:
- In addition to local and global control problems, what other areas of networking could benefit from the interpretability that is provided by Metis? Can the trained decision trees be implemented on IoT hardware? What techniques can make the testing efficient while providing good runtime environments?
- Does Metis need the same dataset for training as the DL-based system it is interpreting? It never really talked about the dataset.
- It strikes me that a network can be thought of as a type of sequence diagram. With that in mind, can similar results be expected for DNNs trained on time series data? What else does this tool provide for understanding the deep learning network? Does the interpretability layer introduce any significant overhead or latency?
Tuesday, Apr 16:
Papers
MimicNet: Fast Performance Estimates for Data Center Networks with Machine Learning(#)
Speaker: Nick
- Slides
- Discussion Notes
- Questions:
- How well does MimicNet generalize across different types of data center networks, including those with varying architectures, scales, and traffic patterns?
- Does MimicNet adapt to dynamic network conditions?
- What are some potential future improvements for MimicNet in terms of speed and accuracy, particularly concerning incremental model updates and modeling network events at higher levels?
- Are all the clusters the same size?
- Can the models that are trained be reused for further tests, if so is there any way to adjust them without retraining them?
- Can this work be adapted for other grids/networks (i.e., electric/social)?
- How can machine learning enable fast and accurate performance estimates by only simulating a small observable subset of the network?
On Model Transmission Strategies in Federated Learning With Lossy Communications(#)
Speaker: Sepideh
- Slides
- Discussion Notes
- Questions:
- With the small accuracy boost, is it worth investing per-configuration time and extra equipment for systems that could use this?
- In the Forward Error Correction (FEC) approach, how is the amount of redundancy decided? Is it dependent on changes in network conditions, such as varying degrees of packet loss or changes in bandwidth over time?
- Does the paper predict how the model would be on a bigger scale system?
- What strategies from the federated learning communication strategies are applicable elsewhere?
- Can the federated problem be further solved by integrating the proposals of this work and “Network-accelerated Distributed Machine Learning” together?
- How to optimize the communication strategy in federated learning to balance the trade-off under realistic lossy network conditions?
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems(#)
Speaker: Rajat
- Slides
- Discussion Notes
- Questions:
- What parameters are observed to evaluate the performance improvements when applying Hoplite to SGD, RL, and ML models?
- How does Hoplite handle network heterogeneity, and what are the potential benefits of accommodating nodes with different bandwidths?
- Is there a specific mechanism for detecting partial failures within a task or node, or does it only respond to complete task or node failures?
- Is there any reasoning for why they specifically used a 2-layer Feed-Forward Neural Network and not any other configuration?
- Are there any limitations inherent to Hoplite which restrict the complexity of the tasks Hoplite is made to facilitate?
- Why not use the better performing ring-all reduce algorithm?
- Could the collective communication described here be leveraged in IoT Networks?
- It seems like this could be accomplished better in a multi-core setup, so under what circumstances do users have access to distributed resources but not a single high-powered resource?
- How to improve the efficiency and fault tolerance of collective communication operations in task-based distributed systems?
Thursday, Apr 18:
Papers
Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks(#)
Speaker: Zach
- Slides
- Discussion Notes
- Questions:
- Question: The paper discusses the use of deep learning models to approximate network behavior. How does the choice of model affect the simulation’s accuracy and speed?
- Joint Optimization: What are the key parameters of the testbed, and how do they affect the evaluation of the proposed RFA algorithms?
- Joint Optimization Of Routing and Flexible Ethernet. How can the integration of Flexible Ethernet be further optimized in multi-layer multi-domain networks to enhance network efficiency while maintaining intra-domain information privacy?
- Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks
- Question: Are PCEs deployed on dedicated servers, or are they integrated into other network devices such as routers?
- Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks" - Does the paper discuss future work and possible contributions for FlexE?
- Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks: The purpose of using FlexE in this research seems to be to divide network links into sub-components that allow for the division of available resources along explicitly declared paths. How does this approach, when used to coordinate multiple layers/domains, approach establishing a minimum bandwidth needed to support all traffic?
- Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks
- For Joint Optimization Of Routing and Flexible Ethernet Assignment In Multi-layer Multi-domain Networks: How can FlexE support be integrated through the layered SDN control plane to improve network utilization?
Auter: Automatically Tuning Multi-layer Network Buffers in Long-Distance Shadowsocks Networks(#)
Speaker: Jonathan
- Slides
- Discussion Notes
- Questions:
- In the context of real world experiments, what are the implications of Auter’s performance improvements, particularly in long - distance transmissions with low bandwidth utilization?
- How does DQN algorithm handle scenarios where it maybe stuck in a sub, optimal answer due to dynamic tuning having a lot of answers?
- Is there overhead added by network perception componenet while collecting network performance metrics?
- Do the authors present a way to achieve even Auteur’s performance, even to a lesser degree, through a specific technique that would not require ML to utilize proxy networks?
- Can similar systems be used in other lossy networks like wifi? Or to implement assured transmission in UDP networks as mentioned in section VI? In figure 8 what happened at hour 16? Are there other parameters that the system could be trained on?
- Distance Shadowsocks Networks: How to effectively improve Shadowsocks performance over long distance networks in Auter?
Speaker: Jacob
- Slides
- Discussion Notes
- Questions:
- Network planning: How representative are these datasets of real-world scenarios, and what considerations were made in selecting them?
- How compatible is the Flexible Ethernet (FlexE) technology with existing network technologies across different domains, and what are the challenges in integrating FlexE into existing systems?
- Because long-distance networks will have more fluctuating traffic patterns and are more dynamic and unpredictable, what are the chances of the reinforcement learning algorithm suffering slow or unstable convergence in real environments?
- Beyond cost minimization, what other criteria could be important in Network Planning? Can NeuroPlan be enhanced to optimize for other objectives like energy efficiency, latency, or reliability or even multi-objective scenarios?
- For small sets on network topolopy, ILP does very well, but it gets bigger due to dynamic topology change. For a bit extra time cost, could we just use NeuroPlan to remove human tuning?
- Let us say that new nodes and links are added/removed. How does the GNN adapt to such changes in the network topology over time? How quickly can it re-train or adjust its parameters in response to these changes?
- Why does the paper only go up to 0,2, and 4 GNN layers. Did they discuss if results after 4 layers become negligible?
- Why is solution pruning chosen specifically as the approach to solving the network planning, despite the paper saying this is a difficult solution?
- The first paragraph on page 263 says they only care about edges, is there a way for this system to consider Hardware capabilities in additional to link capacity ?
- Figures 9 and 10 reference cost, what exactly is included in that?
- What is the optimizer function that they compare their results to, and why don’t they just use that? Further if it is an optimizer, how can the genetic algorithm have better latency than it?
- Can we discuss the MLMDs in a little depth with examples?
- Can we discuss the algorithms they mention at a high level with examples?
- How can techniques such as graph neural networks be utilized to effectively solve complex network planning problems in multilayer multidomain networks? -
Tuesday, Apr 23:
Thursday, Apr 25: