# Caleb Stanford

I am a PhD student in computer science at the University of Pennsylvania since Fall 2016. My advisor is Rajeev Alur. View my CV online or as a pdf.

I graduated from Brown University in May 2016 (ScB Mathematics – Computer Science).

## Research

My research interests include: (1) programming languages and systems for data stream processing; (2) formal verification and testing; and (3) logical foundations of computing.

I am affiliated with the Penn PL Club:

.

### Current research: Correctness and performance guarantees for stream processing systems

Numerous specialized software platforms now exist for processing large quantities of data and responding in real time. Such stream processing systems are popular because they allow the programmer to specify the computation in an intuitive way (e.g., as a high-level query, as a sequence of stream transformations, or as a dataflow graph), and the system will deploy and distribute the computation automatically. Popular modern stream processing systems include Apache Spark Streaming and Apache Flink.

My long-term goal is to make stream processing systems easier to use by providing better guarantees to the developer about correctness and performance. I have past and ongoing projects in these dimensions:

• Correctness and reliability: We have built tool support on top of Apache Storm (PLDI 2019) and Apache Flink (OOPSLA 2020) for testing and verification to prevent bugs due to data parallelism. Our work in Apache Storm shows that, by manipulating data streams with extra type information, programmers can statically ensure that their code parallelizes correctly. Our work in Apache Flink focuses on testing, showing that programmers can use similar annotations to automatically test applications for bugs due to parallelism, without having to modify the application source code. Finally, we are currently designing a specialized stream processing system, called Flumina (in submission), which automatically distributes sequential streaming computations while preserving correctness.

• Performance: We have proposed techniques for compilation and optimization of streaming computations. For high-level query languages incorporating user-defined stateful and quantitative computations, we show that a new intermediate representation can be used to achieve efficient compilation with static bounds on performance (POPL 2019), and formally study the benefits of related program representations (ICALP 2017, TCS 2020). In the vein of distributed compilation, our Flumina system aims to distribute streaming programs while minimizing necessary synchronization between nodes. We are also investigating optimization strategies in this context for edge computing and IoT applications.

### Publications

*equal contribution authors in alphabetical order

### Submissions

*equal contribution authors in alphabetical order

### Software

• ∂Z3 (github): A new constraint solver for regular expressions, now the default in Z3.

• DiffStream (github): Differential testing for Apache Flink programs.

• Flumina (github): A programming model for stream processing with parallelizable synchronization primitives and predictable semantics.

## Other Projects

• Internship at Microsoft Research, Research in Software Engineering (RiSE) Group (summer 2020)

I worked on a new constraint solver called ∂Z3 for solving regular expression constraints on strings, based on symbolic derivatives. We support Boolean combinations of regular expression constraints more efficiently than competing solvers. For more information, see the Microsoft technical report (in submission). Our solver is now the default regular expression solver shipped with Z3. Try it out here.

• Internship at Amazon Web Services, Automated Reasoning Group (ARG) (summer 2019)

I developed tools to automate the security review process at AWS. I leveraged SMT-based technology built at ARG (see Semantic-based Automated Reasoning for AWS Access Policies (FMCAD 2018) and this AWS blog post) to analyze the permissions configurations of cloud resources. I used this analysis in conjunction with other account data to more easily detect AWS account configurations deviating from security best practice.

More information about ARG can be found at Byron Cook’s page.

• Formal verification of properties of knowledge (spring 2016)

My undergraduate capstone project at Brown was on verifying the solutions to several epistemic logic puzzles in the Alloy programming language. The abstract is here. You can also download the Alloy files and run them yourself; they are here.

• Undergraduate research (summer 2015)

I attended the Complexity Across Disciplines Undergraduate Math REU in summer 2015. The work was in graph theory and combinatorics related to computational biology. We studied a particular operation, called a ‘context-directed reversal’, on signed permutations — permutations of $[1,\ldots,n]$ where each element is additionally given a sign.

The main result classifies exactly which signed permutations are sortable by context-directed reversals. We are unable to provide a formula for the number of such signed permutations, but we relate signed permutations to a subclass of graphs, and provide a formula for the number of graphs sortable by an analogous graph operation. We prove that asymptotically it is 1/3 of all graphs.

Here are some slides on our work and a final poster which we presented at the Joint Math Meetings (JMM) in January 2016, winning an Outstanding Presentation Award.

## Exposition

• Geo-distributed stream processing systems (spring 2020)

Existing distributed stream processing systems (DSPS) rely on sending data to a central computing cluster or data center to be processed, which means there are fundamental limits to (1) the end-to-end latency and (2) the network bandwidth used to communicate with the data center, particularly when the data to be processed is highly geo-distributed. In this survey, I look at some works which extend DSPS to the geo-distributed setting by executing stream processing jobs over a geo-distributed network of nodes. This area is not well explored, and this selection of works is fairly preliminary. This was for my WPE II exam (a written and oral presentation requirement in the University of Pennsylvania CIS doctoral program). Here is the written report and slides.

• Online, no-regret machine learning on large sets of experts (spring 2018)

Classical algorithms for online, no-regret learning from expert advice (e.g. randomized weighted majority) work efficiently only for a small number of experts. Specifically, the streaming algorithm in this setting takes O(n) time to process each data item, where n is the number of experts. For a project in CIS 625 (Computational Learning Theory), we discussed some of the literature that attempts to extend the ideas of no-regret learning to efficient algorithms for large or even infinite classes of experts. To see our view on two very different approaches, see the project report.

• SPEED computational complexity estimation (fall 2016)

For a project in CIS 673 (Computer-Aided Verification), I gave a presentation on the paper SPEED: precise and efficient static estimation of program computational complexity. You can read my take on the paper in my project report, or take a look at the slides.

• Model theory notes (spring 2015)

With a group of 8 other students, I ran a group independent study project (GISP) on model theory, in the spring semester of 2015. The website for this class may be still accessible here.

Here are some of the notes I wrote for the class:

(The notes are not entirely free of errors.)

## Contact

castan at cis upenn edu

Last updated .