2016-04-21 12:31:39

WIP: rebuilding all PyPi modules as RPM packages

For some time, Copr can build RPM package directly from PyPi (see previous blogpost), which means that you can build some modules on your own very easily. But are we able to build all modules in PyPi? "All" is equal to number 79 000 right now. Will Copr be able to cope with that? These days, the largest Copr repositories had have hundreds of packages. One experimental repository had 2000 packages. But 79 000?! No one tried that so far. What problems can we experience? Lets try it!

I have created quick'n'dirty script, which gets names of all modules from PyPi and submit them to Copr. Only submitting the builds into Copr takes almost whole day (21 hours). And importing into dist-git would take about 4 days (average time to import one module is 4 seconds). Our dist-git import is single threaded (because it is so fast and no one ever will be able to submit task so fast - haha). So if I would do that on production server, no one will be able to submit package into Copr for 4 days(!), because they would be blocked by importing queue. So I've throttled it by sleep(4) and the importing queue is short now.

Importing was quite fast - 4 seconds per module and 4 days in total. What about actual builds? Each package takes approximately 1-2 minutes to builds. That means 55-110 days in total. Each user can run max. 4 tasks in parallel, which means that it would take about 14-28 days to build all packages. Just for one chroot. On the development server, where there are no other users but me and my colleagues, I temporary raised the limit to 12 tasks in parallel per one user (so I'm blocking my collegues now). And the batch should be finished in 5-10 days.

But there is one problem. We want to build python2 and python3 subpackage. But some packages fails to build with python2 and some fails to build with python3. Such packages need to be build with only one subpackage. But which one? There is only one reliable method - trial and error. So we are going to create Copr project PyPi-2, where we will build all the packages only for python2 only. And another project PyPi-3, where we will build all packages only for python3. Then, we'll gather the statistics and try to build all packages in a final PyPi repository. Packages which succeed in both PyPi-2 and PyPi-3 projects will be build with both submodules enabled. Packages which succeed with only one, will be build with only one subpackage enabled. So that means it will take 28-56 days just to gather those statistics.

Is Copr WebUI able to handle such big number of records in one project? Sort of. The tab "Builds" loads quite long. But it will finish in reasonable time. And the Javascript code is able to sort and search the results in real time (at least on my workstation). However the tabs "Packages" and "Monitor" times out. We are working right now on the performance improvements.

In the time I'm writing this, Copr have finished 38k builds. From which 33.4k failed and 4.7k succeeded. That gives us 12 % chance to successfully build package. This is not a big number. BTW, when we did this experiment with only those PyPi modules, which are already packaged in Fedora, the success ratio was more than 50 %. We use Pyp2rpm for conversion of python modules to RPM package. We started our experiment with version 1.x and we provided lots of feedback to Michal Cyprian (upstream of pyp2rpm). We moved to version 2 recently and soon we will move to version 3. So I hope that success to fail ratio will be much higher in future. And on our TODO list is parsing build requirements and building packages in correct order, which may improve the number a lot.

BTW: while we build these python modules, Copr will automatically creates package definitions and when there is new release in PyPi, then Copr will automatically build updated rpm package within a day!

To sum it up: in 2-3 months we may have a repository with at least ten thousand python package for python2 and/or python3.

Posted by Miroslav Suchý | Permanent link
comments powered by Disqus