ALICE Data Challenge Experience Stefano Bagnasco I. N. F. N. Torino

ALICE Data Challenge Experience Stefano Bagnasco I.N.F.N. Torino


  • ALICE Data Challenge 2004

  • Production infrastructure: AliEn + LCG (+ INFNGRID)

  • Interfacing strategy

  • Software installation

  • Round 1 experience

  • Lessons learned

Available resources

  • Several AliEn “native” sites (some rather large)

    • CERN, CNAF, Catania, Cyfronet, FZK, JINR, LBL, Lyon, OSC, Prague, Torino
  • LCG-2 core sites

    • CERN, CNAF, FZK, NIKHEF, RAL, Taiwan (more than 1000 CPUs)
  • GRID.IT sites

    • LNL.INFN, PS.INFN and several smaller ones (about 400 CPUs not including CNAF)
  • Strategy: interface AliEn to LCG, use AliEn as a front-end for production, manage LCG resources through a “gateway”

ALICE Data Challenge

  • Phase 1 first round (central events) ended April 5

    • 23912 events generated (current model is 1 event per job)
    • Average job running time 10 to 15 hours
    • Up to 1450 jobs simultaneously running, 627 on average (403 on AliEn resources, 207 on LCG with a maximum of 767).
    • 157.8 MSI-2K Hours CPU (84.9 MSI-2K Hours on AliEn native resources, 72.9 MSI-2K Hours on LCG resources)
    • 15.6 TB of data generated in 789096 files
  • No reliable SE or Replica Management available on LCG-2

    • Produced files always moved to CERN CASTOR upon completion (which, in different instances, was responsible for stops in data production)
    • Produced files registered in the AliEn Data Catalogue only

PDC 2004 Statistics

  • Up to 1800 CPU simultaneously under AliEn control

    • 1400 running jobs + 400 saving
    • About half “native AliEn”, half LCG-2+GRID.IT

PDC 2004 Statistics

  • Statistics after round 1 (ended april, 4): job distribution

    • Alice::CERN::LCG is the interface to LCG-2
    • Alice::Torino::LCG is the interface to GRID.IT

PDC 2004 Statistics

  • Statistics after round 1: Failed jobs

    • Alice::CERN::LCG is the interface to LCG-2
    • Alice::Torino::LCG is the interface to GRID.IT

PDC 2004 Monitoring

  • LCG jobs seen through AliEn MonaLisa monitoring

    • Ramp-up slope shows no major performance degradation

PDC 2004 Monitoring

  • job distribution seen through MonaLisa monitoring

    • [FreeCPUs || WaitingJobs] ranking fills large sites first
    • Default ranking (EstimatedResponseTime) not working on many sites


  • First major data production transparently carried out on very different middlewares (AliEn + LCG)

  • Performance of Workload Management middleware greatly improved since last year: no problems from there.

  • Site configuration major culprit for failed jobs: software installation infrastructure still rudimental, problems with published information (e.g. ERT ranking)

  • Phase 2 will require local storage management.

  • Looking forward to test storage interface on the LCG EIS testbed

  • AliEn provides the functionality for distributed analysis, but still awkward to exploit it on LCG-controlled sites

