Academic writing in High Energy Physics

It turns out I was not writing in this blog since last April, which is a bit disappointing.

Since then, I got more and more involved in academic writing; I have a couple draft articles (not within the Collaboration) that I am now polishing, and I got a contract with a prestigious press for a textbook due next year. As a result, I started writing almost every working day, which is something that as a particle physicist you don’t really do.

The life of a particle physicist in a large experimental Collaboration revolves around doing analysis work and service work. The typical service work consists in accessory tasks like working at tuning some calibration of the detector, or reviewing a specific aspect of analyses you did not perform yourself, or other menial tasks that are nevertheless extremely important for the company Collaboration to keep functioning. Not much writing there (except for emails. You will always be writing emails).

The typical analysis work can be roughly schematized in a workflow like this:

  • Design an analysis targeting an interesting physics case, and reading the relevant bibliography (old analyses targeting the same case, related theory papers, etc);
  • Perform the analysis (select an interesting subset of your data sample, estimate some tricky accessory quantities you need, study the systematic uncertainties your analysis is affected by, extract estimates for the parameters you are targeting);
  • Present a few times the analysis in a meeting to get feedback by other members of the collaboration;
  • Write down a detailed internal documentation (the Analysis Note), and get some more feedback;
  • Write down a draft of the public documentation (journal paper or preliminary analysis summary);
  • Get the analysis approved from the point of view of the physics;
  • Get the paper approved from the point of view of the writing (including the best way of relying the desired concepts, and style/grammar considerations).

I don’t claim total generality, I just find that me and most of the colleagues I know have this workflow; you might have a different one, probably a better one, and that’s just fine.

The implication of such a workflow is that you end up writing down the documentation (internal or external) only after having finalized the bulk of all the analysis work; until that moment, the logical organization of the material is deferred to slides presented at meetings. When you write the documentation you are also generally under pressure to respect some deadline—usually a conference in which your result should be presented. Sadly, sometimes there is not even much organization of the material to be done, because most analyses have been performed and optimized in the past, and the modifications you can do are kind of adiabatic (plug in a different estimate for a specific background, or training a classification algorithm, and so on). For new analyses, the track is predetermined anyway (tune your object identification, tune your event selection, estimate backgrounds, plug in some analysis method specific to the case at hand, estimate systematic uncertainties, calculate the final numbers representing your result).

That’s all fine, but the unintended consequence of this workflow is, in my opinion and experience, that academic writing ends up relegated to the role of a task you have to do pretty quickly and is a mere accessory to an analysis that you have already done.

Things are made worse by the latest stage of the workflow; the review of the paper text made by the collaboration (usually in the form of a Publication Committee) is designed to standardize the text of all the Collaboration’s papers and to ensure the highest standards of quality of the resulting text. The problem is that, while iterating with the internal reviewers on the text, you will often feel that your authorship is taken away from you. What I mean is that the set of rules and comments is designed to produce a perfect Collaboration text, and this will strip most of your personality (reflected in your personal writing style) away from the paper. Unless you discuss a lot and manage to slip some lively bits into it.

Just to make things clear, I am not complaining about the existence of these rules; it is certainly desirable that the Collaboration outputs papers with the highest standard of text quality, and setting internal reviews and writing rules is a necessity. It’s just that the papers end up being the Collaboration’s papers, not your papers.

In any case, my point is that this kind of workflow unwittingly teaches us that writing is the last thing you do after having done everything else, and that the final result is not entirely under your control, because it will be the product of the Collaboration.

If you look at other fields, maybe even going into social sciences or the humanities, writing tends to be seen more as a necessary tool to organize your thoughts. This generally applies to the point of using writing to organize your thoughts into a paper-like format, which helps you at any stage identifying what do you need from an analysis point of view, but it also applies in general to taking random notes to fix your thoughts and reorganize them.

Once I started writing for my own projects regularly, I realized that what in high school was a vague unidentified feeling is actually a clear truth: writing is probably the best way of interacting with your own mind, and that is true regardless of what you are writing about (work, feelings, life in general). Writing activates your mind and enhances its capabilities.

In addition to the projects I am working on, I started to regularly jot down notes on pretty much anything (meetings, random thoughts, summaries of papers I have read, etc). The result is that I feel more focussed, I feel like I am thinking more clearly about pretty much anything, and I am retaining information in an extremely easier way. A bonus is also that I can retrieve from my notes any information I have forgotten or not retained!

In high school I could write pretty easily, but I guess my ability has atrophied in the years; now I think I regained it and pushed it even further. I can now probably be defined a writing junkie. A resource that helped me quite a lot in regaining momentum is Joli Jensen’s Write No Matter What,  a very nice book whose main point is that in order to write you should have frequent, low-stress, and high-reward contacts with your writing.

How does all of this apply to this blog? Well, for long I thought that to write regularly I would need to regularly produce very long pieces of text, mainly because the blogs I usually enjoy reading are made of very long posts. Recently I started to follow and enjoy a lot a blog which mixes longer posts and very short random posts, and I finally came to terms with the idea that a blog can be entertaining and useful even if a post is very short or consists in the jotting down of a single random idea. I will try this new format. I actually started this post with the idea of writing just a few lines to kick off the blog again and look, here I am at 1310 words and a couple more paragraphs to go.

I even have plans for a whole series of posts. The COVID-19 boredom induced me to slip a couple slides about The interesting paper of the week in the news slides of the weekly meeting I chair at my institution. It’s a meeting about the group’s CMS efforts, but all the papers I am slipping in are about Bayesian statistics or Machine learning because that’s where my interests lie right now. Yesterday it suddenly dawned to me that porting those weekly slides to weekly posts would make for a great low-stress series.

So, basically, I’m back and with plans of finally kicking this blog truly off on its intended course.

Possibly a bug in the R package “Appell”?

I am using R for a project of mine; I had used R a few years ago in a very elementary way, but I had never gone into it seriously.

Thanks to a statistician—ESR of the AMVA4NewPhysics network, Grzegorz Kotkowski—who did an internship with my supervision at the Universidad de Oviedo last year, I got acquainted with RStudio, and decided to give it a try.

I had a few troubles at the beginning, mostly to figure out the peculiarities of R with respect to other languages, and nowadays I mostly fight with ggplot for tweaking the graphics of my plots.

However, today I was unit-testing my code for a plot I made that was highly suspicious; the variable to be plotted appeared to have a value of Inf most of the times.

Digging inside the code, I figured out that the Bayes Factor in these cases was… negative!!! Now, this is bad in a huge way, because the Bayes Factor is supposed to be positive-defined (at least in usual Bayesian statistics, that obeys the Kolmogorov axioma); it turns out that the negative sign came from the Gauss hypergeometric function {}_2 F_1(a, b, c; x) that I was importing from the package Appell 0.0-4.

According to the package documentation , two methods for computing the function are used, taken from literature: the Forrey method and the Michel-Stoitsov method, the latter being the default.

Now, what I am interested in is {}_2 F_1(0.5, 76.5, 31.5; -1), so I naively tried to change the default method:

hyp2f1(0.5, 76.5, 31.5, -1, algorithm='michel.stoitsov')
[1] -3.575286e-19+0i
> hyp2f1(0.5, 76.5, 31.5; -1, algorithm='forrey')
[1] NaN+NaNi

As you can see, the alternative method (Forrey) doesn’t even work. However, I understand the input parameters have relatively large values, so I was not worried about that. What worried me was still the negative value for the real part as given by the Appell implementation of the Michel-Stoitsov algorithm!

I then cross-checked with Wolfram-Alpha:

> hypergeometric2f1(30.5,76.5,31.5,-1)
7.84840 × 10^-22

Indeed, the result is positive. What to do? Should I assume Wolfram-Alpha is the correct one? I had no idea, nor I wanted to dig into the details of the two calculations, so I thought of cross-checking with Python. Why Python? Because in Python {}_2 F_1(a, b, c; x) is implemented in both mpmath and scipy, two packages that are routinely used and debugged by lots of people;

> library(reticulate)
> mpmath <- import('mpmath')
> scipy_special <- import('scipy.special')
> scipy_special$hyp2f1(30.5, 76.5, 31.5, -1)
[1] 7.848397e-22
> 
> mpmath$hyp2f1(30.5, 76.5, 31.5, -1)
7.84839654445994e-22
> 

OK, it definitely looks like the Appell implementation has an issue!

It might also be that Wolframalpha, mpmath, and scipy all have the wrong implementation, but they are far more under scrutiny (active developers, etc) than Appell, and the value I get from them is the one that yields a meaningful result (positive probability…); at this point I would not bet on Appell’s implementation being correct.

Now the situation forks into a practical solution and a proper solution. The practical solution is that my plot now uses the scipy implementation (the mpmath one had some issues in the R vectorization, whereas scipy works fine).

The proper solution is that I tried to file a bug report: however, the CRAN page for the Appell package does not provide any means of filing a bug. A quick google search pointed me to a CRAN read-only github repository, but it is not possible to open an issue (the interface has not been activated, apparently), and I am not really in this moment available to debug the (FORTRAN, ouch!) code and prepare a pull request.

I therefore wrote an email to Daniel, the package maintainer, and to Gábor, which if I understand correctly is the CRAN responsible for committing the package (if I interpreted correctly the Github repository and commits).

Gábor actually answered, but he is not involved with the package itself, whereas the email to Daniel bounced back, because of “inexisting email address”. I think I’ll just wait until I will have time and willingness to stick the head into the FORTRAN code, and eventually prepare a pull request, but I don’t know when this will happen, nor I know if and when I will get an answer back from the maintainers.

This post is a bit of a public bug report, and a bit of an attempt of getting in touch with the maintainer; if you know of a (more proper?) way of reporting a bug for an R package, please let me know in the comments below!

Active inactivity

My introductory post, Welcome to This New Beginning, came on February 19th, 2018. The day of tomorrow will mark 10 months from that moment; should I feel bad about it?

Yes and no.

Let me explain. I had started this project out of frustration and of the need of having a new haven where to rant about what I really care about, but then stuff piled up and I could not put myself to produce content for this blog. The exception being 4 drafts that are still in a very crude form, and that have been expanded at the staggering rate of about 4 or 5 words every couple months.

As I have been preaching to the AMVA4NewPhysics students (in quality of Outreach Offices of the network), the key for a successful blog is building engagement, and engagement is built in perhaps equal parts by interesting, high-quality content and by a frequent and regular update pace. I would not judge quality by the introductory post (nor by the post you are reading now), so all I am left with is the frequency, which is horrendously low.

Yet this has been a very productive year, in which I found stability in a newfound balance between CMS and non-CMS research and between life and work. I have moved to Belgium in July, and am now a researcher in the Institut de recherche en matématique et physique of Université catholique de Louvain; the institute offers an amazing melting pot of experimental physicists, theoreticians, phenomenologists, and generator folks. I am very excited of being here, and am seeing about bringing to light the non-CMS fruits of this melting pot (from the CMS side, I have worked to an update to the observation paper of ttH production, and most importantly to a paper on WZ cross section measurement and search for anomalous triple gauge couplings that is being submitted to JHEP today or tomorrow).

Most importantly, thanks to the new work gig my girlfriend and I moved in together: we actually got married in September in Belgium (followed by a white-dress party with family and friends in Italy)!

So to speak, Belgium is doing great so far: it gave me an exciting job and an exciting wife!

Now that many things have converged, I hope I will really kick-off this blog with an amazing series of posts! I won’t likely follow up immediately on the four drafts I have, because I am getting excited with the idea of writing a series of posts on De Finetti’s definition of probability, so I will most likely start from those. But you never know; the only certainty so far is that I plan to release the next post before the new year, so I will leave you to calculate your posterior for me to actually release the next post in the timescale I advertized 😀

 

Welcome to This New Beginning

[EDIT: if you want to know what I am up to in a given moment, you can find updated biographical information in the About page. This post will NOT be updated with new affiliations and ventures.]

My name is Pietro, and I have a PhD in physics, but my current main research interest is statistics, with a focus on statistical learning techniques.

In my daily job, I am a researcher in Universidad de Oviedo, Spain, where I work as a particle physicist within the CMS experiment, and am the delegated node PI for the AdvancedMVA4NewPhysics ITN network (you can find me blogging there as well, by the way). I am active in Standard Model (ttH search, WZ cross section, ttbar cross section) and BSM physics (2HDM in the Higgs sector, and SUSY in top sector).

Drawing from my experiences in CMS data analysis, however, I grew fond of statistical techniques, both on the matter of their foundations and on the field of statistical learning. As a consequence, my research focus began to shift from the same usual physics to these more interesting, fundamental, methodological topics. I joined the CMS Statistics Committee, where I found a rich landscape of interesting use cases and a fertile field for discussion. However, there is a specific area of statistics that is rarely applied or discussed in HEP: Bayesian statistics. I felt this is really a pity, so I started roaming extensively this exciting area.

I hope to be able, in this blog, to spark in you some interest for statistical learning and for Bayesian statistics, or at least to give you a good time in reading my random thoughts about these topics.

If you are reading this post, you can safely skip reading the About page, which contains mostly a cut-and-paste of this post, unless you want to find a list of my publications, which is actually present in that page.

Oh, and the look-n-feel of the blog might change a bit in the next days: I am not entirely satisfied with the tinkering I have done so far, so I will most likely tinker with fonts and colours some more.

To conclude, and without further ado, welcome to my new small project: I hope you will enjoy it 🙂