Thesis Topics 2023
This is an overview on topics offered for Bachelor and Master Thesis projects.
Contact me for further topics, or if you have your own idea that fits into our Big Datacube research.
All work follows the procedures, so you may want to study these first.
Programming prerequisites should be taken serious - in all cases non-trivial implementation in one of several languages is involved.
Code will regularly add functionality to our rasdaman system and, as such, be used by our project partners and the general scientific and technical community;
hence code quality (including, e.g., concise tests and documentation) is an integral evaluation criterion.
Generally, I appreciate not only the result, but also the way towards it - therefore, showing continuous progress, initiative, and planful work for sure is an asset.
Knowledge characterized as "advantageous" means that it is not mandatory, but not bringing it along will increase workload significantly, and make deadlines tight.
We reserve to not give a topic to a student if there is too much risk that a good result will not be achieved, for the student's sake.
If your report is of sufficient quality to be submitted successfully to a conference or journal for publication this will be considered a strong plus.
Note that only the topics below will be accepted for supervision, due to resource constraints.
Overview (strikethrough = topic taken)
Establishing a Function Repository for Datacube Analytics
- topic:
User-Defined Functions (UDFs) are a concept in databases which serve to link third-party code at runtime into the database kernel and allow their invocation from within database queries. To the user, this looks like an enhancement of the query language.
The rasdaman datacube engine supports UDFs through a C++ API, and they have been used for various purposes already;
Examples include neural networks, CPU-specific acceleration code, etc.
Gradually, the set is getting so large that the functions need to be managed.
Task on hand is to establish a coherent schema for such packages, bring in the UDF packages existing, document them, and provide examples for each one.
Additionally, one new package is to be added, namely a selection of representative OpenCV functions.
- team size: 1 - 2
- prerequisites: solid C++ skills
- classification: application of existing code, some own programming
- particularities: task is scalable for 1 or 2 students
Geospatial Big Data Processing with R
- topic:
Next to python, R is a primary language for data analytics today. In this topic focus is on multi-dimensional geo data, such as 3D satellite image timeseries and 4D weather data.
Such "Big Data" cannot be transported to and processed on the client, therefore processing needs to be delegated to the server. One high-level, declarative language is Web Coverage Processing Service (WCPS).
See this tutorial for an overview on WCPS.
Task on hand is to establish Jupyter notebooks illustrating how to send WCPS queries to a server, receive results, and make them available in R for further processing and display. The result, if of sufficient quality, will become part of open-source rasdaman community.
Available is:
- a coupling of R to rasql, another datacube processing language
- servers with plenty of spatio-temporal Earth data for testing and demonstration
- Jupyter notebooks for WCPS in python
- team size: 1
- prerequisites: DBWS knowledge of Web services; R (or python) knowledge is advantageous, but not mandatory
- classification: scientific Web services
- particularities: Jupyter notebook programming
Semantic Modelling of Geospatial Coverages
- topic:
The term "coverage" denotes, in standardization, a concept for digital representations of multi-dimensional physical fields, like magnetism, wind speed, temperature, etc.
Practically, geo coverages resemble gridded (raster) data, point clouds, and general meshes.
A new standard, ISO 19123-1, is in the final stage of adoption. It describes coverages more concisely than before.
See this paper for an informal introduction, and this tutorial for coverage services.
This occurs in a time where there is a high interest in obtaining data representations suitable for automatic reasoning, so-called knowledge representation.
Task on hand is to model coverages as knowledge graphs, based on the UML model established with ISO 19123-1. Using this graph technique a small example coverage should be described, and a few reasoning examples should be given.
Implementation tools include the RDF data standard and tools like Protege and Apache Jena. A survey should be conducted to pick the most suitable and convenient tool for the above task.
Emphasis is on grid data, such as 2D drone images, 3D satellite image timeseries, 4D x/y/z/t weachter data. Note that this task is NOT about image processing or image understanding and also not about ML, but rule-based reasoning about structural information like spatio-temporal extent ("do these two coverages overlap?"), deriving important parameters ("what is the size of this coverage, uncompressed?"), and also about completeness of a coverage data structure (as they are usually established piecemeal they might be incomplete at some given state, and the tool needs to recognize this).
- team size: 1
- prerequisites: none in particular
- classification: automatic reasoning on geo raster data
- particularities:
Enhancing an Open-Source Web GIS with Raster Analytics Query Support
- topic:
QGIS is an open-source Web GIS allowing to display vector nd raster data fetched from different servers.
One such server can be the rasdaman datacube service which supports, among others, the spatio-temporal geo analytics language WCPS.
QGIS has some rudimentary support for WCPS through a particular plugin which, however, is bugy, inconvenient, and poor in functionality.
Task on hand is to develop a new WCPS plugin which accepts queries (ie, strings), sends them to a server, and displays the results appropriately.
Plugin implementation language is python. Servers are available, with plenty of data to test with.
The expected result is well-working (python) plugin code, published in the official QGIS plugin repo.
- team size: 1
- prerequisites: python
- classification: client development
- particularities: -
Rim Redundancy for Enhanced Big Datacube Analytics
- topic:
Array databases allow analytics on massive multi-dimensional arrays via their specialized array query language.
The operators include not only per-pixel operations, but also more expensive "neighbourhood" operations where each new pixel is derived from a pixel region instead; one way to describe such operations is through convolutions, allowing operations like edge detection with the Sobel filter.
Evaluation of such operations is complicated through the partitioned storage on disk: at the rim of a partition pixels are needed which reside in the neighbour partition, and accessing an additional partition is costly.
In order to find an efficient evaluation where each partition is read from disk no more than once, one idea is to extend each partition with a "rim" of adjacent pixels so that up to some size convolution operations can be executed inside the partition. The overlap, i.e.: data redundancy introduced complicates storage management during insert and update operations - it needs to be managed.
Task on hand is to experimentally demonstrate feasibility of this "rim redundancy" on the basis of rasdaman community by (i) extending the insert statement with a clause determining rim thickness in each dimension, (ii) storing this extra information in the DBMS, and (iii) evaluating such information during query evaluation by modifying the partition access operator. An extensive performance evaluation is expected.
- team size: 1
- prerequisites: Linux, C++
- classification: algorithmic
- particularities: the concept is well established, main effort is on implementation and testing
Aviation-Specific Map Visualization
- topic:
In aviation, the common 2-D geographic maps have special additions (see this example) which are important for pilots during fligh planning and execution. In our research we strive for powerful visualization techniques for map data coming from multi-dimensional datacubes; currently aviation-specific presentation methods are missing.
Task on hand is to demonstrate aviation-specific visualization of map data derived through datacube queries.
Implementation will be done using the rasdaman Array DBMS.
- team size: 1
- prerequisites: Web programming, JavaScript
- classification: algorithmic, visualization, GIS
- particularities: non-trivial presentation rules exist which commonly are taken for granted by the experts, but undocumented - so they have to be discovered in the requirements analysis phase
Dynamic Repartitioning of Large Arrays
- topic:
Large arrays - in particular: larger than main memory - are stored on disk partitioned ("tiled") into subarrays, allowing to retrieve partial arrays through "subsetting" without loading the complete array into RAM. Even some data formats, like TIFF and NetCDF, support such an internal partitioning.
Array DBMSs can hide such partitioning by performing internal management. Advanced systems give support to the administrator for defining particular tiling schemes, thereby allowing to tune the storage structure to query workloads.
Typically, sytems allow only regular tiling (i.e., equi-sized partitions); the most advanced system in this respect, rasdaman, supports arbitrary tile structures, defined through a storage layout sub-language. However, an initial tiling is not enough - sometimes query patterns change, and then the tiling should be re-adjusted. Obviously, this involves physical tile reshaping and copying on disk which is expensive. Optimizing it is highly desirable, therefore.
Task on hand is to devise an algorithm which, given an existing and a target tiling pattern, performs a minimum number of copying steps to transform the stored array from the former to the latter structure. This algorithm is to be embedded in the UPDATE statement of the array query language. Both theoretical considerations and a benchmark will motivate that the result is optimal. Implementation will be done on open-source rasdaman community Array DBMS.
- team size: 1
- prerequisites: C++
- classification: algorithm design, language integration
- particularities: the query parser is implemented in flex and bison
Augmented Reality for geo-visualization using rasdaman
- topic:
The rasdaman services available (see these demos)
offer multi-dimensional, typically spatio-temporal datacubes.
Task on hand is to develop a front-end, using the standards-based APIs, which takes these data and provides a virtual-reality immersive experience using some appropriate device.
- team size: 1
- prerequisites: VR, geo data, API programming
- classification: system integration
- particularities: needs immersion into geo raster data
A Time Slider for Time Selection in Datacubes
- topic:
Spatio-temporal datacubes can be visualized in various ways.
Typically, user-friendly area selection is only available for the spatial dimensions and not for time.
Task is to integrate the EOX timeslider, based on D3, with NASA WorldWind, thereby replacing the initial implementation available currently.
This integration needs to be documented in a way that allows adding it to further tools later, such as Microsoft Cesium.
- team size: 1
- prerequisites: Javascript
- classification: browser GUI tool integration
- particularities: D3, CoffeeScript
Vector Files as Datacube Query Parameter
- topic:
OGC Web Coverage Processing Service (WCPS) is a geo datacube query language with integrated spatio-temporal semantics based on the notion of a multi-dimensional coverage which may represent a datacube.
Queries can be parametrized, among others with vector polygons allowing to "cut out" abritrary regions.
Currently, these vectors have to be provided in an ASCII representation called Well-Known Text (WKT).
However, the most widely used format in the geo universe is not WKT, but ESRI Shapefiles, a binary format.
Goal is to add support for the Shapefile format for vector upload in the petascope component of rasdaman, next to the existing WKT decoder.
Open-source libraries for decoding exist, for example GeoTools and shapelib; one of those should be used.
Appropriate tests should be established to demonstrate that the Shapefile decoder works properly.
- team size: 1
- prerequisites: Java, Linux
- classification: query language enhancement
- particularities: -
Interval arithmetics in a Datacube Query Language
- topic:
The rasdaman array query language, rasql, uses multi-dimensional intervals (mintervals) to specify cutouts from a multi-dimensional array ("datacube").
In places it would be handy to have expressions available on mintervals, such as interval union, intersection, and difference.
Goal is to extend rasql with support for such minterval arithmetics.
Appropriate tests should be established to demonstrate that the code works properly.
- team size: 1
- prerequisites: C++, Linux
- classification: query language enhancement
- particularities: -
|