How the Pipeline Works: in detail

The spectroscopic pipeline is a multi-purpose, highly automated pipeline
for processing the roughly 106 galaxy, 105
QSO, and 105 stellar spectrafrom the SDSS spectrographs.
The pipeline is designed to extract, calibrate, and process all spectra
taken in the course of the Survey, and specifically to:
-
archive the reduced, red/blue merged, co-added 1d spectra;
-
spectroscopically classify objects independently of the target selection
pipeline;
-
estimate redshifts and provide spectral information, with required redshift
accuracy and success rates depending upon object type.
In addition to these science goals, the pipeline has the following roles
in survey operations:
-
Provide real-time diagnostic S/N outputs for the observers at the 2.5m
telescope so that the total required exposure time for each spectroscopic
plug plate can be determined;
-
Provide diagnostic Quality Control outputs for spectroscopic data processors
at Fermilab so that data quality on each plate can be rapidly assessed
with respect to the Survey Requirements;
-
Provide feedback to target selection on classification and redshift
success rates, enabling the Working Groups to optimize target selection
parameters for survey efficiency and completeness
The Spectroscopic pipeline is split operationally into two parts, 2d and
1d.
The 2d pipeline reduces the raw data and calibration images from the
red and blue CCD cameras from each spectrograph and outputs merged, co-added,
flux-calibrated spectra and noise for analysis by the 1d pipeline.
The 1d pipeline determines emission and absorption redshifts, classifies
spectra by object type, and outputs spectral information about each object.
Spectroscopic Observations
Each spectroscopic plug plate with 640 fibers typically has 3-5 spectroscopic
exposures of 15 minutes duration, with the exact number determined by observing
conditions (weather, moon). This set of `science' exposures is preceded
and followed by a series of shorter exposures for calibration: arc lamp
exposures, flat-fields, and a 4-minute `smear' exposure on the sky for
spectrophotometric calibration, in which the telescope is moved so that
the 3" fiber on each object effectively covers an 8" aperture. The
`smear' exposures are meant to account for object light excluded from the
3" fibers: the smear frames are assumed to give an accurate measure of
the true spectral shape of he objects and are used for spectrophotometric
correction.The calibration and science exposures are immediately processed
through a quick version of the 2d pipeline run at the telescope (APO2d)
to inform the observers whether he calibrations were successful and
to provide S/N diagnostics on the science exposures.
For each science exposure, the $(S/N)^2$ through the SDSS imaging passbands
is measured by APO2d and fit as a function of fiber magnitude for each
spectrograph camera. The SDSS observers take repeated 15-minute exposures
until the cumulative median $(S/N)^2 > 15$ at $g'=20.2$ and $i'=19.9$ in
all 4 cameras. Although these fiber magnitudes are fainter
than most of the spectroscopic targets,measurement at these magnitudes
provides a robust measure of $S/N$ across the range of moon conditions
we encounter. These $S/N$ values in APO2d correspond too $(S/N)^2
> 20$ for the full spectro2d pipeline at the same fiber magnitudes) due
to the latter's use of optimal extraction. In clear, non-moony conditions,
the $(S/N)^2$ threshold is easily reached in 3 exposures;in (partial) cloudy
or moony conditions, more exposures may be required. Currently, the science
exposure time is kept fixed at 15 minutes, and a minimum of 3 science exposures
is taken to ensure adequate cosmic ray rejection.
Spectro2d: Extraction and Calibration of Spectra
The Spectro2d pipeline, known as idlspec2d comprises a series
of IDL processing routines and associated `utility' routines. The version
of the code used for the Early Data Release plates is v4.6.2.
The inputs for the 2d pipeline are the raw data files (fiber flats,
arcs, and science frames), a two-dimensional pixel flat field image for
each camera, the plPlugMap file (information on the spectroscopic target
for each fiber), a file describing the arc lamp lines, and 3 files describing
the spectrograph hardware: opBC.par provides information
on bad pixels/columns for a given spectrograph camera and observing date;
{\it opECalib.par} is the spectrograph CCD calibration file, specifying
the electronic characteristics for each chip (i.e., read noise, gain, bias
level, linearity corrections, etc); and opConfig.par provides information
on the CCD dimensions (data, bias, overscan regions) and the amplifier
configuration.
The outputs of the 2d pipeline currently passed to the database
include: for each fiber, the flat-fielded, sky-subtracted, red-blue merged,
exposure-combined, flux-corrected spectrum binned in constant velocity
pixels ($\log \lambda$), the noise (inverse variance) in each pixel, mask
information (e.g., bad pixels, rejected outlier pixels such as those due
to cosmic rays, strong sky lines, etc), the wavelength dispersion, and
the target information from the plugmap file.
These outputs are passed to the 1d pipeline for processing. It is anticipated
that intermediate outputs from 2d (i.e., the spectrum for each exposure,
with associated sky, error, and flux-correction information) will soon
be available in the database as well. Spectro2d also outputs a number of
diagnostic plots for QA and QC.
Periodically (for example, when the camera electronics is changed),
a pixel-to-pixel flat field image is created from a stack of flat
field images using Spectro2d routines. A number of flat-field images are
taken, with the collimator moved between exposures, so that the images
of the fibers on each camera are shifted between readouts. Each flat-field
image is fit by a model along each column. A number of such images are
then stacked, so that the camera chip is essentially covered by the co-added
illumination, and the combined model is used to measure and, when applied,
remove pixel-to-pixel variations in the instrument response. These variations
are expected to be reasonably stable over time.
In normal operations, the Spectro2d pipeline carries out the sequence
of tasks below. In the first sequence, the pipeline reduces the individual
images (frames)
from each camera of each spectrograph separately. The subsequent procedures
are carried out using multiple frames.
-
Each raw image is read in and processed with the CCD calibration, configuration,
and bad pixel files. When it is read in, the image is checked for evidence
of possible hardware problems. The bias is subtracted using the unilluminated
bias regions of the CCD, and the image is divided by the previously generated
pixel flat. The resulting calibrated image is returned in electrons. An
image that contains the corresponding weights (from the errors) for each
pixel is also generated; e.g., bad pixels are given zero weight.
-
The flat-field images are spatially traced on the CCD: for each fiber,
the flat-field image centroid in column position is fit by a polynomial
in row number. The flat-field trace will be used as the first-order trace
for the arc and science exposures as well.
-
The flat-field image is optimally extracted: for each row on the CCD, the
fiber profiles are fit by 320 Gaussians plus a polynomial for background
light. These fits will also be used for the first-order object extraction
of the science and arc frames.
-
The arc lamp images are traced (tweaking up the trace from the flat-fields)
and optimally extracted, and the line centroids measured. The combination
of arc lamps yields up to 16 lines in the blue cameras (roughly 3800-6170
A) and 39 in the red (roughly 5780-9230 A). The centroids are matched
to air wavelengths of known arc lines from the arc lamp file, and a wavelength
solution as a function of pixel derived. When the wavelength solution is
subsequently applied to the science fibers, it can be tweaked using the
known positions of certain sky lines
-
The flat-field spectra are wavelength-calibrated, normalized, and combined
to form a `superflat' for each spectrograph camera. This is done by `stacking'
the 320 normalized flat-field spectra and performing an iterative least-squares
bspline fit with outlier rejection on this $320\times 2048$ oversampled
data to get an effectively continuous function. For each fiber, the superflat
is resampled at every pixel and divided into the extracted flat-field spectrum
to form the `fiber flat'. In this way, flat-field variations between
fibers are removed.
-
For the science exposures, the object and sky fibers arepatially traced
(again with tweaking from the flat-field trace) and optimally extracted;
in the extraction, the Gaussian fiber profile fitting can also be tweaked
from the fiber-flat image. In the extraction, scattered light is removed
by a 4th order Chebyshev polynomial fit. Outlying pixels are rejected and
masked. In addition, the arc lamp calibration is used to subtract near-IR
scattered light in the CCD itself. The extracted object and sky spectraare
flat-fielded by dividing by the `fiber flats'. If more than one flat-field
calibration is available (they are taken before and after each set of science
exposures for a plate), the fiber flat with highest $S/N$ is chosen. The
object images are wavelength-calibrated using the arc lamp solutions, with
tweaking to sky lines found in the image and matched to known line positions.
In this process, the wavelength solution is refit to vacuum wavelengths
and corrected to the heliocentric frame. Again, the best available arc
calibration is used.
-
The object images are sky subtracted using a `supersky' built from
32 sky fibers per plate. The supersky is constructed using a bspline fit
with iterative rejection, similar to the procedure for constructing the
superflat. For each fiber, the supersky is resampled at every pixel in
the object spectrum and subtracted.
-
Telluric absorption in four wavelength regions in the red is divided out
using spectrophotometric and reddening standard stars: these are used to
construct four `superTelluric' spectra using the bspline fitting procedure.
-
At this stage, the individual flat-fielded, sky-subtracted, wavelength-calibrated,
telluric-corrected red and blue spectra for each frame (each science exposure)
are written out.
The following procedures are carried out using multiple frames.
-
Spectrophotometric flux correction: the counts in each frame are put on
the same scale as a chosen frame; the chosen frame is either the smear
exposure or (if thesmear does not exist or have sufficient S/N), the highest
S/N science frame. The chosen frame's spectrum is fit by a simple polynomial
in lambda, which is applied to the other frames. This procedure corrects
the overall amplitude and shape of the spectra.
Note: since the smear exposure procedure was implemented part
way through commissioning, many of the Early Data Release plates do not
have smear exposures. For the plates with available smear exposures, v4.6.2
of spectro2d does not differentiate (flag) the objects corrected by the
smear from those corrected by the best science frame. In addition, the
spectrophotometric accuracy achieved by the smear procedure has not yet
been measured in detail. For these reasons, the EDR spectra should not
be assumed to have precise spectrophotometric calibration.
-
Flux calibration. The highest $S/N$ exposure for each spectrophotometric
(F-type) and reddening standard star is used. To take into account possible
spectral variability of the standards (e.g., due to differing metallicities
and temperatures), a Principle Component Analysis is applied to these spectra.
The resulting template spectrum corresponding to the first (i.e., largest)
PCA component is taken as the standard spectrum, $C_S(\lambda)$. The calibrated
flux in an object spectrum, $F_O$, is then obtained from the uncalibrated
counts in that object, $C_O$, via $F_O = C_O (F_S/C_S)$, where $F_S$ is
the `true' standard star spectrum.Currently, the pipeline takes for $F_S$
the synthetic (composite) F8 subdwarf spectrum from Pickles (1995). (In
the future, we expect to improve on this by carrying out multiple smear
exposures of the fundamental SDSS standard stars.) This procedure removes
the instrument response as a function of wavelength in each spectrograph
camera.
-
For each object, the different science frames, both red and blue
halves, are `stacked' and fit with the iterative bspline, with inverse
variance weighting. In the process, outliers due to cosmic rays are rejected
and masked. The combined, merged spectra are resampled in constant velocity
pixels ($\log \lambda$), with a pixel scale of $69$ km/sec. Exposures on
multiple nights are combined. If a plate is re-plugged, however, only the
exposures with a given plugging are combined
Spectro1d
The 1d pipeline, which analyzes the combined spectra output by spectro2d,
iswritten in C and TCL. The version of the code used for the Early Data
Release plates is v5.3.2. The code outputs a FITS i mage for each fiber:
it includes the 1d spectrum, noise, and mask arraypassed from the 2d pipeline,
basic information about the target from photo and the Target
Selection pipelines, as well as line measurements, redshift determinations,
and warning flags.
The code is designed to attempt to measure an emission and absorption
redshift independently for every targeted (non-sky) object. That is, to
avoid biases, the absorption and emission codes operate independently,
and they are both independent of any target selection information.
The Spectro1d pipeline performs the following sequence of tasks for
each object spectrum on a plate:
-
The 1d spectrum and inverse variance are read in, along with the pixel
mask.
-
The continuum is fit with a 5th order polynomial least-squares fit,
with iterative rejection of outliers (e.g., strong lines). The fit continuum
is subtracted from the spectrum.
Emission line finding, fitting, and redshifts
-
Emission lines (peaks in the 1d spectrum) are found with a
wavelet filter. The wavelet transform is defined by$$w(a,\sigma)
= {1\over \sqrt{\sigma}} \int_\infty^\infty f(x) \left(x-a;\sigma\right)
dx ~~,$$where $f(x)$ denotes the continuum-subtracted spectrum as a function
f wavelength, and we apply a Mexicanhat wavelet of the form, $g(x) = \left({2-{x^2\over
\sigma^2}}\right) \exp \left(-{x^2\over 2\sigma^2}right) ~~.$$For fixed
wavelet scale $\sigma$, the wavelet transform is computed at each pixel
center $a$; the scale $\sigma$ is then increased in geometric steps
and the process repeated. Once the full wavelet transform is computed,
the code finds peaks above a threshold, and liminates multiple counts (at
different $\sigma$) of the same peak by earching nearby pixels. The output
of this routine is a set f peak positions, which are candidate emission
lines.
-
Emission line fitting, identification, and redshift. A reference list of
optical emission lines for galaxies and QSOs is consulted. The more common
lines are given non-zero weights (separately or galaxies and QSOs) in the
procedure bo .Each significant peak found by the wavelet routine is assigned
a trial line identification from the common list (e.g., MgII) and an associated
trial redshift. The peak is fit with a Gaussian, and the line center,
width, and height above the continuum are stored. If the code detects close
neighboring ines, it fits them with multiple Gaussians. Depending on the
trial ine identification, the linewidth that it will try to fit is physically
constrained. The code then searches for the other expected common emission
lines at the appropriate wavelengths for that trial redshift, and computes
a Confidence Level (CL) by summing over the weights of the found
lines and dividing by the summed weights of the expected lines. The CL
is penalized if the different line centers do not quite match up. Once
all the trial line identifications/redshifts have been explored, an emission
line redshift is chosen as the one with the highest CL. The exact expression
for the emission line CL has been tweaked to match our empirical success
rate in assigning correct emission line redshifts, based on manual inspection
of a large number of EDR spectra
A separate routine searches for high-redshift ($z>2.3$) QSOs by identifying
spectra that contain a Lyman-alpha forest signature: a broad emission line
with more fluctuation on theblue side than on the red side of the line.
The routine outputs the wavelength of the Ly$\alpha$ emission line; while
this allows a determination of the redshift, it is not a high-precision
estimate, because the Lalpha line is intrinsically broad. Spectro1d
effectively treats this as an additional emission-line redshift.
If the highest CL emission line redshift uses lines only expected for
QSOs (e.g., Ly$\alpha$, CIV, CIII), then the object is provisionally classified
as a QSO.
If any of the identified lines is broader than 500 km/sec (FWHM), then
the object is also provisionally classifed as a QSO.These provisional classifications
will hold up if the final redshift assigned to the object agrees
with its emission redshift.
Cross-correlation redshifts
The spectra are cross-correlated with stellar, emission-line galaxy,
and QSO template spectra todetermine a cross-correlation redshift and error.
When an object spectrum is cross-correlated with the stellar templates,
its emission lines are masked out, i.e., the redshift is derived from the
absorption features. The cross-correlation routine follows the Tonry-Davis
technique:
the continuum-subtracted spectrum is Fourier-transformed and convolved
with associated CLs. The corresponding redshift errors are given by the
widths of the CCF peaks. The cross-correlation CLs as a function of peak
level are empirically calibrated based on manual inspection of a large
number of EDR spectra (see figure on CLs vs. success).
The cross-correlation templates are obtained from SDSS commissioning
spectra of high $S/N$, and comprise roughly one for each stellar spectral
type from B to almost L, two late M-type templates, a non-magnetic and
a magnetic WD, an emission line galaxy, a composite Luminous Red Galaxy
(LRG) spectrum (from Eisenstein, etal), and a composite QSO spectrum (from
Vanden Berk, etal). The composites are based on co-additions of about 2000
spectra each. The template
redshifts are determined by cross-correlation with a large number of
stellar spectra from SDSS observations of the M67 star cluster, whose
radial velocity is precisely known.
The cross-correlation redshift is chosen as the one with the highest
CL from among all the templates.
If there are discrepant high-CL cross-correlation peaks, i.e., if the
highest peak has $CL < 0.99$ and the next highest peak corresponds to
a CL that is greater than 70% of the highest peak, then this is given a
warning flag (see below). In this case, the code extends the cross-correlation
analysis for the corresponding templates to lower wavenumber and includes
the continuum in the analysis, i.e., it chooses the redshift based on which
template provides a better match to the continuum shape of the object.
These flagged spectra are also manually inspected as a back-up.
Final Redshift and Classification
-
Spectro1d assigns a final redshift to each object spectrum, by choosing
the (emission or cross-correlation) redshift with the highest CL, and outputs
a redshift status (zstatus). The choices for the redshift status are:EMLINE-HIC
(emission line redshift, with CL>0.75),EMLINE-LOC(0.35<CL<0.75),
XCORR-HIC (cross-correlation redshift, with CL>0.75), XCORR-LOC, XCORR-EM
(both cross-correlation and emission CL>0.75$ but cross-correlation
CL is higher), EM-XCORR (the reverse), INCONSISTENT (emission and
cross-correlation CL$both >0.75$but discrepant, FAILED (CL< 0.35), or
NOT MEASURED (sky fiber or no spectrum). There are also redshift
status states that can be set if the object is manually inspected and its
redshift is manually assigned: MANUAL-HIC (which sets CL=0.95), and two
MANUAL-LOC states (CL=0.4 or 0.65). Objects for which the red or blue half
of the spectrum is missing have their CLs reduced by a factor of 2, so
they are automatically flagged as having low-confidence.
-
Classification. All objects are classified as QSO, high-z QSO,galaxy, star,
late-type star, or unknown.
-
If the object has been identified as a QSO by the emission line routine,
and if the emission line redshift is chosen as the final redshift, then
the object retains its QSO classification. Also, if the QSO cross-correlation
template provides the final redshift for the object, then the object
is classified as a QSO. If the object has a final redshift $z>2.3$ (so
that Lyman-alpha is or should be present in the spectrum), it is classified
as a high-z QSO.
-
If the object has a redshift $cz < 450$ km/sec, then it is classified
as a Star. If the final redshift is obtained from one of the late-type
stellar cross-correlation templates, it is classified as a Late-type Star.
-
If the object has a cross-correlation CL< 0.25, it is classified as
unknown.
Spectral Information
Once the final redshift has been determined, the pipeline computes additional
spectral information about the object:
For galaxies, we compute the following absorption-line strengths:
-
Lick indices (Trager et al. 1998, ApJS, 116, 1)
21 absorption-line strengths on the revised Lick/IDS line-strength
system
-
Brodie & Hance 1986, ApJ, 300, 258
CNB, H+K, CaI, G, Hb, MgG, MH, FC, NaD
-
Diaz, A. I., Terlevich, E., \& Terlevich, R. 1989, MNRAS, 239, 325
CaII8498, CaII8542, CaII8662, MgI8807
-
4000 A break (as ratio of blue/red) where
blue = (3751. - 3951 A.) and red = (4051. - 4251. A)
-
HK ratio (as ratio K/H) K=(3921. - 3946.
A) and H=(3956. - 3981.)
Gaussians are fit at the positions of all expectedemission lines in the
reference list (not just the common lines).
Galaxies are classified by a PCA analysis, using cross-correlation with
eigentemplates (see Connolly, etal). The code outputs 5
eigencoefficients and a classification number.
Redshift Warning flags
Spectro1d outputs a series of warning flags. These provide additional
compact information about the spectra for end users and are used in certain
combinations to trigger manual inspection of a subset of spectra on
every plate.
Manual Inspection of Spectra
A small percentage of spectra on every plate are inspected manually
and, if necessary, the redshift and classification corrected.Currently
the algorithm used to trigger manual inspection is the following.
-
We check all spectra that have at least one of the following warning flags
set (200, 800, 1000, 4000, 10000) OR finalz > 3.2$ OR zstatus = EMLINE-LOC
OR zstatus=EMLINE-HIC OR zstatus = XCORR-LOC Oorzstatus = NOT MEASURED.
However,if the object has a final $CL>0.98$ and zstatus of either XCORR-EMLINE
or EMLINE-XCORR, then despite the above, it isnot manually checked.Note
that all objects with classification `unknown' and zstatus
`failed' are triggered for manual inspection.
-
Flagged spectra are examined interactively in a two-step process foreach
plate. The inspector first goes through each flagged spectrum
-
and makes provisional changes to the final redshift and classification
if necessary. The inspector then examines a postscript file ofall triggered
and provisionally changed spectra for a given plate and decides if further
changes are necessary. The Spectro1d output files for the corrected spectra
are then rewritten, withone of the manual zstatus states listed above.
Only after this procedure is completed are the spectro pipeline outputs
for a plate written to the SDSS database.
-
For the $100+$ EDR plates, on average about 7%of all spectra are checked
manually via this algorithm, and < 1\ of all redshifts are changed manually.
-
Tests on the validation plates (see below) indicate that the trigger above
successfully finds > 95%of the spectrafor which the automated pipeline
assigns an incorrect redshift.
-
Spectro1d provides diagnostic output for Spectroscopic data processors
to monitor QA (e.g., compare emission and absorption redshifts for all
objects where both are available and check dispersion and outliers; plot
redshift vs. magnitude and redshift histogram for each plate, etc).
Spectroscopic Pipeline Testing and Performance
In order to assess the performance of the spectroscopicpipeline, we
carry out a number of internal and external checks. A subset of 39 EDR
plates (comprising about 23,000 spectra)is used extensively for validation
of the pipeline. Every spectrum on the validation plates has been
manually checked,and a `truth' table with the manually determined (or manually
confirmed) object classification and redshift has been constructed for
each of these plates. Whenever a new version of the 2d or 1d pipeline is
`tagged', the updated version of theentire pipeline is used to re-process
the validation plates andthe results compared with the truth tables. This
validation procedure allows us to assess the performance of the hardware,
of the automated pipeline, and of the manually-corrected pipeline, to identify
systematic problems in the pipeline and the data, to check that the redshift
confidence levels are empirically accurate, etc.
Based on the validation plates, the version of the pipeline used for
the EDR(with manual inspection triggered as above) has the followingestimated
performance statistics:
Classification:
99.7 $\%$ Galaxies correctly classified,
97.9 $\%$ QSOs correctly classified,
99.1 $\%$ Stars correctly classified
Redshifts:
99.7 $\%$ Galaxy redshifts correct,
98.0 $\%$ QSO redshifts correct,
99.6 $\%$ Star redshifts correct.