|1||Information Extraction for Database Population from Scientific Articles|
Scientific data are scattered among huge amount of articles produced by different research groups. Keeping track of all studies is extremely challenging yet it is important for certain research questions. E.g. to study an influence of climate change on animal communities it is important to know the basic information, since different types of animals live in different kinds of environments that are bound to different climate. This information is spread among numerous papers but could be collected and extracted automatically—this task is called Information Extraction (IE).
The aim of the thesis is to implement one of the state-of-the-art IE methods populate a database developed at the Finnish Museum of Natural History. The method will most probably utilize transformers, RNN, or other deep learning network. This thesis is a good opportunity to start learning NLP, since IE is one of the core NLP tasks.
Greedy algorithm for Shortest Cyclic Superstring
|The Shortest Linear Superstring problem is one of the most known problems in stringology (string algorithms). The aim is, for a set of strings as input, to find a shortest linear string which contains all the strings of the input. Even if this problem has been strongly studied, a lot of questions stays open. In particular, a 30 old conjecture says that one of the easiest algorithm (the greedy algorithm) has an approximation ratio better than all the approximation ratio prooved for all the other algorithms. The topic of this thesis is to understand how this greedy algorithm works in the Shortest Circular Superstring problem where the superstring is not linear but circular.|
|Bastien Cazaux, Veli Mäkinen||https://arxiv.org/abs/1809.08669|
Fast nearest neighbor search.
|There are several approaches to speed up (approximate) nearest neighbor queries in large data sets. Generally, they involve an initial stage where an index data structure is constructed. The index can be used to perform queries when new points arrive. The thesis can review and compare various different approaches (tree-based, locality-sensitive |
hashing, random projections, ...) and/or experiment with new variantions of the theme.
Zezula, P., Amato, G., Dohnal, V., Batko, M., Similarity Search: The
Metric Space Approach, Springer, 2006.
V. Hyvönen, T. Pitkänen, S. Tasoulis, E. Jääsaari, R. Tuomainen, L.
Wang, J. Corander, and T. Roos (2016). Fast nearest neighbor search
through sparse random projections and voting, in Proc. 2016 IEEE
International Conference on Big Data (IEEE Big-Data 2016), Washington
DC, Dec. 5–8.
|4||Data mining historical textual traditions||Historical textual traditions, including manuscripts, early printed materials and collections of oral traditions, offer a rich source of information for data mining. Various exploratory data analysis methods can be used to summarize and visualize the data. A thesis project could apply known methods to new data collections available through public repositories or ongoing collaborative projects.|
J. Tehrani, Q. Nguyen, and T. Roos, (2016). Oral fairy tale or literary
fake? Investigating the origins of Little Red Riding Hood using
phylogenetic network analysis, Digital Scholarship in the Humanities
"White-box" machine learning
Deep learning methods based on massive neural network architectures tend to be very hard to understand and interpret. They can be described as "black-boxes" that are good for mapping inputs into ouputs but it is nearly impossible to see (and understand) what happens inbetween. There have been initiatives to construct "white-box" methods, i.e.,
|6||Solving data analysis / machine learning problems cost-optimally using constraint solvers|
Constraint optimization, maching learning, and data mining are today well-established and thriving research fields within computer science. Each of the fields have contributed fundamental techniques and algorithmic solutions that are today routinely applied for addressing hard computational problems in various real-world context. However, the possibilities of harnessing the highly efficient constraint solving technology available today in providing generic and efficient solutions to various machine learning and data mining problems, such as different kinds of classification, structure learning, probabilistic reasoning, and pattern mining tasks, have only recently been realized, and there is plenty of opportunities for developing novel algorithmic solutions with optimality-guarantees to machine learning tasks via employing constraint solving.
This area offers topics for several MSc theses, and the specifics (including the problem focused on) can be agreed on together with the thesis supervisor.
Instance structure and empirical hardness of NP-hard problems
NP-complete problems, such as the Boolean satisfiability problem (SAT), are deemed "intractable" by classical complexity theory. Despite this, modern exact algorithms for SAT, i.e., SAT solvers, are today routinely used for solving extraordinarily large problem instances of NP-hard problems arising from various industrial (e.g., software and hardware verification) and AI (e.g., automated planning) problem domains. However, the question of why exactly SAT solvers are so powerful in solving instances of NP-hard problems in the real world is not well understood.
This problem setting gives rise to various research questions suitable for MSc theses, ranging from empirical to theoretical studies aiming at understanding the relationship between the underlying structure of real-world problem instances and the algorithmic techniques implemented in modern SAT solvers.
|8||Testing and debugging constraint optimization solvers|
Constraint solvers, such as those for the integer programming and propositional satisfiability (SAT) paradigms, offer off-the-self generic tools for solving NP-hard problems at large via declaratively expressing problems using mathematical constraints and feeding them to a constraint solver for obtaining an exact solution to the original problem. While this approach has turned out to be a successful approach to solving various hard computational problems, obtaining correct answers (especially in cases when no solutions exist) relies on the correctness of the solver implementations. This MSc thesis topic combines testing and debugging techniques from software development and constraint solving. The aim is to develop practical fuzz testing and delta debugging tools for constraint optimization (especially, Boolean optimization) which help solver developers to test and debug their solver implementations.
This topic is suited for both algorithms and software systems students.
|9||Algorithmic Uses of the Gumbel-Max Trick|
There are powerful paradigms for solving optimization problems, like (integer) linear programming and branch-and-bound, which however do not readily allow random sampling from the distribution proportional to the objective function. Recently, an old, simple reduction from sampling to optimization, called the Gumbel-Max trick has found non-trivial algorithmic uses in machine learning and artificial intelligence. The thesis will review that literature.
1. Chris J. Maddison, Daniel Tarlow, Tom Minka: A* Sampling. NIPS 2014: 3086-3094.
2. Carolyn Kim, Ashish Sabharwal, Stefano Ermon: Exact Sampling with Integer Linear Programs and Random Perturbations. AAAI 2016: 3248-3254
|10||Strong Exponential Time Hypothesis|
The Strong Exponential Time Hypothesis (SETH) asserts, roughly, that the classical CNF Satisfiability problem with n variables cannot be solved in time O(c^n) for any constant c < 2. Recent research has proved SETH-based lower bounds for various important problems, often using ingenious reductions. The thesis will review the state of the art and discuss the key open problems in the area.
1. Mihai Patrascu, Ryan Williams: On the Possibility of Faster SAT Algorithms. SODA 2010: 1065-1075
|11||Compensating for small labeled data|
Project context: a system for web-scale surveillance of news media. Massive volumes of news stories are analyzed, classified and clustered according to various criteria using deep learning neural networks.
For example, we try to determine what type of event is reported in the story (from a large set of different event types), or the sentiment – whether a story is positive/negative for a given entity (person/company/etc.) mentioned in the story.
A major bottleneck in deep learning is the shortage of labeled data (as in most supervised learning). The project will explore methods for data augmentation – leveraging small amounts of labeled data by transforming it to produce more novel data usable for learning. Data augmentation has been successful in some applications in image analysis.
An alternative for a similar purpose is transfer learning – improving performance on a given task by using data that was labeled for a different task.
|12||Models of language evolution|
Over the last 15 years, several models have been proposed for language evolution. Methods have been developed for modeling historical relationships among languages in a language family. But little has been done in the way of formal comparison of the effectiveness of the models.
The models can be "applied" in various ways: e.g., to build family trees, which can then be compared against trees proposed by linguists, based on manual analysis. Some models may be used to predict unseen data, or to reconstruct ancestor word forms.
The models typically find regularities in lexical data – in lists of related words. Better models should find more regularity in the data. Alternatively, a probabilistic model assigns probability to observed data, which gives us a measure of model quality: better models assign higher probability to the data.
Question: do intrinsic and extrinsic measures of model quality correlate as we expect? For example, does a theoretically better model produce trees that agree better with the linguists' results?
The project may involve literature surveys, empirical studies using existing models and language data, or both.
|13||Modeling knowledge states in language learning|
Project context: a system for learning languages. Learners use arbitrary texts, which they upload to the system, from which the system then creates a wide variety of exercises for practice in a game-like environment, while the system tracks the learners' progress.
Problem 1: modeling the learners' competence.
While standardized tests treat learner's competence as a value on a linear scale, in sophisticated models of learning, competence is never a scalar. It can be a state in a large space of possible knowledge states, or a probability distribution over a set of possible states. The paths through the space are not arbitrary, they are constrained by the nature of the subject being learned. Can we infer the structure of the knowledge space from data – by observing correct vs. incorrect answers to many exercises from many learners ?
We also seek multi-criterion estimates of the complexity of a text that a student chooses for learning. We consider both objective and subjective criteria. As objective criteria, people previously used lexical and syntactic complexity – frequent usage of rare words, complex syntactic constructions, etc.
Among a population of students, the system can rank students' competence: we expect that learners with "lower" competence make more mistakes on any given text (on average) than learners with "higher" competence.
Therefore, subjective measures may predict that text A is more complex than text B if students – at a fixed level of competence – make more mistakes on A than on B.
How well do objective and subjective criteria that have been proposed in the literature correlate on actual user data? (One motivating goal is: when a learner uploads a text, the system should rate the text on a scale – too easy/just right/too difficult for you.)
The project may involve a literature survey, an empirical study, or both.
|14||Human powered hierarchical clustering with relative distance comparisons||The input to a hierarchical clustering algorithm is typically a distance matrix that contains pairwise distances between the data items to be clustered. Commonly this matrix is defined e.g. in terms of the Euclidean distance between feature vectors. However, in some applications computing absolute (Euclidean or other) distances is not feasible. This happens for example if the distance information is collected using a crowd of human annotators. Humans are rather poor at consistently evaluating some notion of absolute semantic distance (on some arbitrary scale) between, say, the contents of two photographs. On the other hand, humans are fairly adept at comparing items relative to each other. The objective of this project is to design and analyse an agglomerative hierarchical clustering algorithm that uses (possibly noisy) relative distance comparisons to compute a clustering dendrogram.||Antti Ukkonen|
|15||Machine learning in classic games|
One of the early successes in applying machine learning to board games was Tesauro's NeuroGammon. In chess, on the other hand, champion level performance was obtained mainly without machine learning. Recently, reinforcement learning has been used in breakthrought in computer go and poker. This area offers various possible thesis topics, for example focussing on a particular game, or a particular learning technique.
|16||Automating a music composition process||There are numerous methods for automated or algorithmic composition of music. This thesis project will approach this goal from a unique point of view: modelling and implementing the composition process of an actual human composer, who will participate in this project. The work will include conceptualisation and formalisation of the composition process together with the composer, in sufficient detail so that it can be implemented as a computational model, and experimented with. Some background in music is useful, so is interest in computational creativity.||Hannu Toivonen|
|17||Variant projection in pan-genomics|
Added September 16, 2019
Variant calling is the process of identifying how an individual differs from the consensus (reference) genome of the species. The standard approach is to align high-throughput sequencing reads on the reference, and then detect places where many alignments support the same variant. With such methods, several sequencing projects have gathered a huge catalogue of human variation and there is now an active field of research pondering how to exploit this pan-genomic information for the next projects. This research aims to identify ways to amend the linear reference genome with extra information to improve accuracy of variant calling. One such approach is to replace the reference genome with an ad hoc genome tailored for the sub-population under study . Such ad hoc genome is closer to the sequenced individual, and hence the alignment and variant calling can be conducted more reliably. However, the variant calls need to be projected back to the reference genome in order to be compatible for downstream analysis tasks. We have implemented one projection strategy, but there is an alternative that might alleviate some of the challenges of the current approach.
In this project, we want to implement the alternative projection strategy and compare it experimentally with the current one. In addition, the current framework does not optimally support all heterozygous variants, and the goal is to incorporate handling of these in both the current and the new strategy.
The tool (PanVC), where these new features are to be added, is mainly implemented in C++.
In addition to the rather experimental implementation work, the thesis content can focus on more theoretical considerations around the theme.
 Valenzuela, Norri, Välimäki, Pitkänen, Mäkinen. Towards pan-genome read alignment to improve variant calling. BMC Genomics, 19:87, 2018. https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4465-8
|18||Parallelization of string matching algorithms extended to graphs|
Added September 18, 2019
Current research in algorithmic bioinformatics focuses on extending string algorithms to work on a string and a labeled directed graph (instead of just two strings). This is motivated, for example, by the rise of pan-genomics, where a “pan-genome” is a labeled graph encoding all genomes observed in a population.
However, several such problems, for example exact string matching, can only be solved in quadratic time on graphs, as opposed to linear time on strings. This motivates the search for alternative strategies that work in practice, on inputs of billions of characters.
In this thesis you will develop methods to “decompose” a graph in several components such that these algorithms can be run in parallel on each such “component”, and then combine the results for each “component”. The focus is on aligning a string in a labeled directed acyclic graph (DAG) under the models of exact matching and edit distance. You will review (serial) algorithms for a string and a graph, review parallelization strategies for computing the edit distance between two strings, develop such decomposition strategies, prove their correctness, implement them and then perform experimental evaluations.
Ideally, you are already familiar with string algorithms (e.g. through the String Processing Algorithms course) and have experience with parallel programming. Moreover, ideally you also have some experience with GPU programming, in case the algorithms can also be implemented there.
Alexandru Tomescu (email@example.com)
|19||Learning and utilizing document plans in end-to-end text generation|
Added December 4, 2019
An overview (holds for this and the two following topics):
We offer several MSc thesis topics related to automated generation of textual reports describing given statistical data. This so-called data-to-text generation is an extensively studied topic in natural language generation. Most of the research is done using standardized data, such as weather forecasts or sport games outcomes. The thesis topics we offer take place in a more general context: automated production of a textual report from statistical data of various nature – from unemployment rate to greenhouse emissions. An additional difficulty is to make these methods work in a multilingual setting, e.g. Finnish and Swedish in addition to English.
One of the main drawbacks of end-to-end models, i.e. systems that learn to generate text from given structured data, is that their output lacks high-level structure: even though each sentence looks natural, the text as a whole fails to maintain a coherent narrative. The goal of this work is to develop a machine-learning approach for deciding the information content of the text from training examples. These plans would then be used in a hybrid NLG system that is partially rule-based. Plans will consist of a sequence of database records that should be one by one transformed into natural language sentences and/or phrases. Plan generation can be interpreted as classification and ranking machine learning problems. First, given a set of data tables and a set of texts, where each sentence is linked to one table record, it is necessary to train a classifier, which would decide for each record whether it should be included into the plan. Second, the selected records should be sorted (ranked) into a final sequence. Alternatively, this can be seen as an iterative contextual ranking problem; given a specific context (the present state of the document plan), the goal is to find the most suitable next record to include in the plan. The student would implement one or more methods to solve this problem and then test them on the dataset mentioned above.
|20||Learning sentence templates for data-to-text generation|
Using manually produced or at least verified templates to generate text makes the output controllable and interpretable. Learning templates is done by replacing certain word groups in the text with names of data slots. This could be done by a set of rules using mapping between sentences and data rows.
E.g. consider sentence “External operating revenue declined by 3.7 per cent.”
and the following database record:
Given this database record the sentence could be converted into template “<PARAMETER> declined by <NUMBER> per cent”. Templatization could be done using manually compiled rules. Alternatively, a small set of seed rules could be extended using bootstrapping. The student will implement one or more template-learning methods and test them on the dataset mentioned above.
|21||Information Extraction for verification of generated texts|
Information Extraction (IE) can be seen as a task opposite to data-to-text generation since it transforms unstructured text into a structured form. For instance, if a text generation task is to generate a news article from a table with sport match outcome, then the corresponding IE task would be to construct such table from a free text describing the event. This thesis topic will investigate the use of IE to verify the output of a natural language generation system. The student will implement an IE system, that could be trained using texts produced by humans from the dataset mentioned above. A possible approach to building an IE system is to start from manually constructed patterns and then automatically bootstrap more complex templates. Once the IE system is working with reasonable performance on human-generated text it can be used to assess performance of text generation by transforming text into a data table and then measuring the overlap between this reconstructed table and the original input data. In a more advanced setting the IE system could be used to improve a baseline data-to-text generation model by rejecting sentences that do not have support in the data.
|22||Event detection in historical news using permutation tests|
Added December 4, 2019
The main objective of this thesis is to find significant changes in word usages in historical newspapers, which potentially may correspond to events interesting for historical research. The main idea of permutation test is straight-forward: a time point corresponds to signal increase or decrease if the difference between its left and right area is significantly higher than an average difference that can be obtained in a large number of permutations. However, historical newspaper impose a number of problems related to incomplete and noisy data, multidimensionality and data imbalance. The student will have to develop a basic idea into a working algorithm, which would be applicable to digital humanities research.
|NewsEye - a Digital Investigator for Historical Newspapers|
Experimental Evaluation of Dynamic Graph Algorithms
Added February 26, 2020
For the past three decades various dynamic graph algorithms have been developed but most of the work have focused on only theoretical evaluation of such algorithms. This often results in complicated algorithms that improves the theoretical bounds but are impractical for use in applications. Thus, an equally important aspect is the empirical performance of an algorithm on real world graphs. After all, the ideal goal is to design an algorithm having a theoretical guarantee of efficiency in the worst case as well as superior performance on real graphs. Often such an empirical analysis also leads to the design of simpler algorithms that are extremely efficient in real applications. Thus, such an analysis bridges the gap between theory and practice.
The aim of the project would be to do experimental evaluation of some state of the art algorithms for a given graph problem, and to understand their behavior on the various kinds of graphs which may result in the development of simpler algorithms. Based on the progress and performance of the student, we can also consider a paid Research Assistant position as the project progresses.
Alexandru Tomescu (firstname.lastname@example.org)
|24||Implementation, optimization, and experimental evaluation of graph algorithms for genome assembly|
Added on August 5th, 2020
In this project you will implement some very recent algorithms for graph problems motivated by the genome assembly problem in Bioinformatics, optimize these algorithms, and modify them to deal with some aspects of real data.
There is freedom to the student to steer the project either in a direction focused more on algorithm engineering, or on practical bioinformatics aspects.
|Alexandru Tomescu (email@example.com)||https://arxiv.org/abs/2007.04726|
2. Networking and services
|6G Systems and Cloud Continuum|
Given the emerging high-density networks with the advent of versatile connectivity and new verticals, such as IoT and Industry 4.0, and their evolution, it is evident that data gathering and processing will be inherently distributed in 6G. The distributed processing is supported by a programmable network in terms of fine-grained network slicing and edge/fog computing for local processing capabilities.
We have many M.Sc. thesis topics available pertaining to cloud continuum in 6G.
|Sasu Tarkoma and Ashwin Rao|
|Cellular network meets AI|
The developing 5G and future 6G, while improving networking capacities, also pose unprecedented challenges to the base stations, namely more complicated mobility management and protocol adaptation with lower latency tolerance. In this project, we investigate the potential of different machine learning techniques in tackling such challenges.
|Pengyuan Zhou and Pan Hui||https://pengyuan-zhou.github.io/|
|Effectiveness of Opensource 5G core implementation||This work will include working with the Kumpula cellular test network and will be done in the context of the 5G Force project. The indoor cellular base stations in Exactum are currently offering networking connectivity using three different core networks: one from Nokia, one from OpenAirInterface , and one from a custom core network built by researchers at Aalto University. The proposed work will include expanding the network to include the 5G implementation of OpenAir, and also the free5Gc. This work will give the thesis work experience on building and maintaining networks along with the ability to work on tools and techniques to evaluate and improve the performance of communication networks. Updated November 11, 2019|
Sasu Tarkoma and Ashwin Rao
|Efficient monitoring in the far-edge.|
This work will be done in collaboration with Nokia Bell Labs and will be based around a system called Infobus developed by them. Infobus uses a publish-subscribe mechanism built on top of Apache Kafka to disseminate the information gathered by nodes spread across the network. This thesis work is aimed at proposing, implementing, and evaluating a solution to extend the capabilities of the Infobus for edge networks including the far-edge. The project will also give the student the experience of collaborating with industrial research labs including the possibility to file for patents. Updated on October 3, 2019.
Sasu Tarkoma and Ashwin Rao
Previous thesis based on Infobus: https://helda.helsinki.fi/handle/10138/228837
Reference article summarizing edge computing: https://queue.acm.org/detail.cfm?id=3313377
Applications of blockchains in mobile systems
|Blockchain technology provides, among other purposes such as cryptocurrencies, means to verify integrity of log information from potentially large systems. One example of an important log in mobile communication systems, is the log of phone calls and data usage. This is important because the log relates to the eventual bills sent to the mobile user. In case of roaming, fees may become surprisingly big, although inside European Union the high roaming fees are vanishing. The study would find out how blockchain technology could be used for such purpose in mobile systems. Alternatively, the study could focus on identifying other use cases of blockchains in mobile communications systems.||Valtteri Niemi|
|Novel uses of human-computer interaction technologies|
Humans are creative tool users and able to discover novel solutions for practical problems in their work and everyday life tasks. This creative potential remains however scientifically weakly understood.
A master's thesis in in this area may be, for example, about development of interactive prototypes and services for creative practices; studies of artists and/or knowledge workers and their uses of IT in problem solving; or studies aiming to understand human creativity in interactive settings.
Studying the future of IT use with interactive prototypes
Computer science is probably the only discipline that is able to create futuristic prototypes and study their use in action, with end users.
Master's theses in this area may focus on developing research methods for studying futuristic use contexts with end users or development of futuristic technologies and studying how they could be used.
Understanding changes in human life through social media and messaging
Master's theses in this area may involve uses of machine learning to understand messaging-based communication; qualitative and/or quantitative analyses of large text masses; or development of methods to visualize the ongoing large-scale changes in communications.
|Security services in 5G testbed||Department of Computer Science runs a 5G technology testbed that contains 20+ base stations and a core network that is built using virtualization and cloud technologies. Among other things, the testbed allows experiments with new security and privacy services that could be built on top of future 5G networks, or alternatively, as native parts of such networks. An example of such service is extended protection against location and identity tracking of 5G users. The study would identify an interesting potential service, build a simple implementation and run an experiment over the 5G testbed. Finally, results of the experiment are analysed and reported.||Valtteri Niemi|
|Overview of CINCO group thesis topics|
This general overview of the CINCO research groups working area indicates three categories of topics:
Specific MSc thesis topics available in connection with the CINCOLab resources
|Lea Kutvonen||introductory slideset.|
|Efficiency of computational service monitoring||In inter-enterprise computing, monitoring of services have a key role in intercepting messaging not confirming to eContracts that govern the collaboration. Monitoring can thus enable privacy-preservation, prevent unfortunate transactions when a collaborating partner proves unworthy of trust, and control quality of service agreements. In inter-enterprise collaboration environments, each party is resposible of protecting its own services and resources, and observing the fulfilment of eContract rules. Therefore, monitors can be bottlenecks in the system, unless carefully designed and placed.|
The research question of this thesis focuses on the analysis of monitoring cost, utilising both calculus and practical measurements.
|Tracing ownership of information||Digital rights management is an area where access to information - or rather, media - can be restricted to an identified group of users. The domain of research provides solutions focusing on access control while other solutions focus on marking the target object itself with copyright or ownership information.|
In the inter-enterprise collaboration environments, an essential aspect is controlling the legitimate and noticing illegitimate flows of information. This thesis should provide a survey of DRM techniques applicable in this kind of environment.
|Trust calculus / Privacy calculus / Reputation calculus|
In service ecosystems, trust and privacy-preservation must be systematically supported by the ecosystem infrastructure - no isolated, partner-wide solution is sufficient. In recent litterature, steps have been taken to address the modeling principles in a formal manner. The purpose of these two topic settings is to analyse this litterature (survey), and critically select the appropriate ones for Pilarcos style ecosystems. The thesis should cover only trust or privacy - or reputation restricted to reputation-based trust systems.
|Investigating intra-blockchain transactions and network performance|
Investigating intra-blockchain transactions and network performance: Understanding the blockchain solutions for decentralised exchanges which enables intra-block chain exchanges. Investigate or simulate the network performance and propose a new routing solution for them.
This work requires investigate the source code analysis of different block chain implementations and implementing a new exchange network.
|Understanding the synchronisation techniques of blockchains|
The fundamental backbone of any blockchain network is the distributed consensus mechanism by which a consensus on the order of the blocks in the chain is reached in a distributed fashion. Every participant in the network must have a synchronised view on this order of the blocks in the chain. With the absence of any centralised node participants leverage the consensus algorithm to achieve the required synchronisation. However, the network latency involved on broadcasting any newly created block introduces “forks” at different nodes creating different views, a symptom for inconsistencies among the participants in the network. Hence, it should be avoided whoever possible. Toward this aim, it is important to study different network properties which could minimise forks as much as possible.
This work requires investigating the source code different block chain implementations and implementing new syncing mechanism. Knowledge about different synchronisation techniques, such as RSYNC, RDC are plus.
|Big data analysis on block chain transactions|
Our particular interest is Ethereum based tokens (ERC). Analaysing Ethereum transactions and finding out the number of different tokens per block. Identifying incentivise transactions and other interesting facts such as congestion in the network by identifying transaction delay and correlating with real life incidents. And investigate the any kind of patterns in the transaction for some particular patters so the we can predict future fate for some tokens.
The student must have taken the Big data framework course and motivated to the take the challenges.
|Mohammad Hoque, Sasu Tarkoma|
|Privacy-preserving indoor positioning|
Indoor positioning is a challenging task because global satellite based systems, such as GPS, are not available. Various methods based on, e.g. WiFi signal strength, have been proposed but in these kind of systems a specific server is needed to calculate user's exact location. This violates user's location privacy. The study is about combining privacy-preserving techniques based on cryptography with state of the art indoor location algorithms.
|Edge cloud server placement in wild|
Edge cloud computing is an up and coming research field which proposes to place smaller compute cloud servers on network edge such as base station, wireless access points etc. along with a typical centralized cloud. The availability of computing and storage resources near users helps provide low latency for several time-constrained applications such as automated vehicles, drones, etc. One of the prominent research questions that still requires an answer is the placement of these servers in the real world.
The study may involve evaluating the findings by integrating real world network traces and available simulators
|Implementing cross layer multipath TCP scheduler in Linux|
Multipath TCP (MPTCP) is a variant of TCP which aims at allowing a TCP connection to use multiple paths on all available network interfaces such as WiFi, cellular etc. simultaneously. MPTCP aims to maximize available network usage and provide inherent redundancy. In our previous studies, we found that the default scheduler for MPTCP hinders its capability to achieve maximum performance and several locally available system parameters can be leverage to provide a more holistic network connectivity.
|Critical data detection|
Edge Computing is a new computing paradigm where server resources, ranging from a credit-card size computer to a small data center, are placed closer to data and information generation sources. With edge computing we can significantly decrease the volumes of data that must be moved. However, how to identify the critical data that we do not want to move.
A thesis project could apply known methods of anomaly or critical data detection to a console log data repository of a distributed monitoring system.
|Samu Varjonen, Francois Christophe|
3. Software systems
MLOps -- as a derivative of DevOps -- is about practice and tools for ML-based systems that technically enable iterative software engineering practice. We have several funded positions in the area of MLOps in our research projects (IMLE4 https://itea4.org/project/iml4e.html and AIGA https://ai-governance.eu/) that can be tailored to the interest of the applicant. For further details, contact Mikko (firstname.lastname@example.org).
(Jukka K Nurminen)
|AI System Testing|
As machine learning moves from laboratories to real use, good ways to deploy, test, and maintain AI solutions become increasingly important. There are multiple topics about this research theme:
|Jukka K Nurminen|
When computers are increasingly making decisions, such as who gets a loan or how does an autonomous vehicle behave in case of emergency, the ethical questions of such decision making become important. The thesis could look at technical issues and solutions to assist in ensuring fair decisions. Alternatives include:
|Jukka K Nurminen|
|Programming of Quantum Computers|
Quantum computing is considered a promising direction for efficient solution of e.g. combinatorial optimization problems, which are common in machine learning and operations research. The aim is to look at the issue from practical perspective: what can be done today (e.g. with D-WAVE, IBM-Q ), how to formulate the problems for quantum computing, understand what are the main bottlenecks, and what are the most promising future directions. The work could include experimentation with quantum computers and their simulators and software development toolkits.
|Jukka K Nurminen|
Creatively Self-Adaptive Software Architectures
|We have recently started exciting research in the intersection between the research fields of self-adaptive software and computational creativity, with the goal of developing novel software architectures that can creatively adapt themselves in unforeseen situations. This initiative is a new research collaboration between Discovery Group of Prof. Hannu Toivonen and ESE. There are different options for thesis work with either of the groups.|
Tomi Männistö (Self-adaptive, architecture),
Hannu Toivonen (Computational Creativity)
|Robotics software and software architectures|
We are building an interesting line of research in the area of software and software architectures for robotics. This area is an intersection of software engineering and artificial cognitive systems, and takes into account knowledge from different domain areas where robots perform tasks in the physical world. Thesis work in this area can range from more technical and practical to theoretical. The perspectives include both questions about traits of the robotics platform architecture that make development of robotics applications easier and questions about implementing software for robotics systems in different kinds of physical environments.
Open Source Software Development
Open Source Software development is characterised by openly available online collaboration and communication systems. There is a growing body of work examining the data accumulating in such systems. Descriptive studies have examined, e.g., how the development process unfolds and how the social communication structure corresponds to technical actions in the code. Other studies have tried to leverage the the repository data for improving software quality, easing communication, or automating development tasks. Theses in this area could focus on, e.g., analysis of communication patterns using Natural Language Processing techniques, collecting and using software metrics, automated development process support, or methods for analysing specific kinds of repository data.
Keywords: Mining software repositories, Open Source Software
Guzzi et al., Communication in open source software development mailing lists, in Mining Software Repositories (MSR), 2013. - https://ossmeter.com/
|The emergence of millions of remotely programmable devices in our surroundings will pose signicant challenges for software developers. A roadmap from today’s cloud-centric, data-centric Internet of Things systems to the Programmable World highlights those challenges that haven’t received enough attention.||Tommi Mikkonen|
Digitalization and Digital Transformations: Impacts on Software Engineering And Systems Development
|How should digitalization be taken into account in software development processes? What is the role of customer/user involvement in software-intensive systems development (e.g., digital services)? What are the key quality attributes? What new software engineering skills and competencies may be needed? What is the role of software (and IT) in general in different digital transformations (e.g., vs. business process development)? How is digitalization related to traditional software engineering and computer science disciplines in different contexts?||Petri Kettunen|
High Performing Software Teams
|How is (high) performance defined and measured in software development (e.g., productivity)? Which factors affect it - either positively or negatively - and how strongly (e.g., development tools, team composition)? Can we "build" high-performing software teams in systematic ways, or do they merely emerge under certain favorable conditions? What are suitable organizational designs and environments for hosting and supporting such teams?||Petri Kettunen|
|Modeling human brain-signals||Our current interaction with information systems and digital information rely on explicit interaction. Could we mine the interest of the user directly from the human mind? Could unsupervised machine learning methods reveal interesting patterns of neural signals when we are engaged with digital information?||Tuukka Ruotsalo|
|Conversational search||There is a gradual shift towards searching and presenting the information in a conversational form. Chatbots, personal assistants in our phones and eyes-free devices are being used increasingly more for different purposes, including information retrieval and exploration. With the recent success of deep learning in different areas of natural language processing, this appears to be the right foundation to power search conversationalization.||Tuukka Ruotsalo||http://augmentedresearch.hiit.fi/|
|Crowdsourced natural language training data for machine learning||Natural language user interfaces (e.g. chatbots) allow the user to simply talk to the computer, much like they would to another person. To use machine learning for building a natural language UI requires usually very large amounts of training data. This training data consists of pairs of utterances in natural language, and their "meaning", e.g. some user interface action. Generating such training data for specialised applications is challenging, because in the absence of a working system it is often difficult to predict in advance what kind of language the users will use. The objective of this project is to study the use of crowdsourcing techniques, for example "games with a purpose" for collecting such training data without implementing the, possibly costly, final user interface.||Antti Ukkonen||http://dx.doi.org/10.1145/1378704.1378719|
|Testing and debugging constraint optimization solvers|
See topic description above under "Algorithms" specialization.
This topic is suited for both algorithms and software systems students.
|Transforming business-level policies to monitoring rules|
Inter-enterprise collaborations are governed by eContracts that define what are the required business processes between partners, what business services each partner provides, and especially, what are the nonfunctional properties required in the collaboration. The nonfunctional properties traditionally involve technical quality of service levels, but when enhanced to business area, interesting examples include nonfunctional properties capture trust, privacy, transactionality of interactions, and dependability of service.
The collaborations can be controlled through eContracts and enterprises' local policies. Technically, the automated control is performed by low-level monitors. From administrative perspective, the policy rules must be declared using a high-level language - or rather, a family of languages working together.
The goal of this thesis is to a) select a small set of languages fitting together and b) implementing transformation tools to translate these administrator designed rules to a low-level language run by the monitors.
|Performance optimization on big data platform|
A cloud-hosted application is expected to support millions of end users with terabytes of data. To accommodate such large-scale workloads, it is common to deploy thousands of servers in one data center. Meanwhile, existing big data platforms (e.g., Hadoop or Spark) employ naive scheduling algorithms, which consider neither heterogeneity of resources nor differences of jobs. This motivates a more advanced scheduling scheme and performance optimization algorithms in big data environments.
The goal of this thesis is to a) understand the main scheduling algorithms on Hadoop and Spark; and b) to implement performance optimization tools to improve the performance of the systems.
|Artifact Recognition Based on Images Captured by Vision System in Autonomous ships or harbour|
Cameras are used in state-of-the-art autonomous systems in order to get detailed information of the environment. However, when cameras are used outdoors they easily get dirty or scratches on the lens which leads to image artifacts that can deteriorate a system's performance. Work has been done on how to detect and cope with specific artifacts, such as raindrops.
A thesis project could apply known methods of weather recognition or artifact recognition to a repository of maritime images and review and compare different approaches. Can be extended to include the thermal images in the repository.
|Samu Varjonen, Francois Christophe|