Empirical Photometric Redshifts for SDSS DR6

Hiroaki Oyaizu1,2, Marcos Lima2,3, Carlos Cunha1,2,
Huan Lin4, Joshua Frieman1,2,4, Erin Sheldon5


1Department of Astronomy and Astrophysics, University of Chicago
2Kavli Institute for Cosmological Physics, University of Chicago
3Department of Physics, University of Chicago
4Center for Particle Astrophysics, Fermi National Accelerator Laboratory
5Center for Cosmology and Particle Physics and Department of Physics, New York University


We present some details about the public galaxy photometric redshift catalog based on SDSS DR6 data. More details and explanations can be found in Oyaizu et al. (2007).



Photometric Selection

Our galaxy photometric sample was drawn from the SDSS CasJobs website http://casjobs.sdss.org/casjobs/ . We checked some of the SDSS photometric flags to ensure that we have obtained a reasonably clean galaxy sample. In particular, we selected all primary objects from DR6 that have the TYPE flag equal to 3 (the type for galaxy) and that do not have any of the flags BRIGHT, SATURATED, SATUR_CENTER, or NOPETRO_BIG set. We also took into account the nominal SDSS flux limit by only selecting galaxies with dereddened model magnitude r < 22.0. An example of the query we used to extract the data with RA in the range [0,170) is given below.


declare @BRIGHT bigint set @BRIGHT=dbo.fPhotoFlags('BRIGHT')
declare @SATURATED bigint set @SATURATED=dbo.fPhotoFlags('SATURATED')
declare @SATUR_CENTER bigint set @SATUR_CENTER=dbo.fPhotoFlags('SATUR_CENTER')

declare @bad_flags bigint set @bad_flags=(@SATURATED|@SATUR_CENTER|@BRIGHT)

select
objID, ra, dec,type,dered_u,dered_g,dered_r,dered_i,dered_z,
petroR50_u, petroR50_g, petroR50_r, petroR50_i, petroR50_z,
petroR90_u, petroR90_g, petroR90_r, petroR90_i, petroR90_z

into MyDb.all_ra_0_170
FROM PhotoPrimary
WHERE ((flags & @bad_flags)) = 0 AND (dered_r<=22.0) AND (ra>=0.0) AND (ra<170.0)
AND (type = 3)



The final catalog contains 77,418,767 objects classified as galaxies by PHOTO.
The SDSS algorithms webpage provides further suggestions for flag cuts which we STRONGLY recommend the user to see. In particular, we recommend that users consider keeping only objects with the BINNED1 flag set and remove objects with the NODEBLEND flag. BINNED1 objects were detected at >= 5&sigma in the original imaging frame. BLENDED objects have multiple peaks detected within them, which PHOTO attempts to deblend into several CHILD objects. NODEBLEND objects are BLENDED but no deblending was attempted on them, because they are either too close to an EDGE, or too large, or one of their children overlaps an edge.
For a complete description of all photometric flags, click here.

Spectroscopic Set


We have constructed a spectroscopic sample consisting of 639,911 galaxies that have SDSS photometry measurements (counting repeated photometric observations) and that have spectroscopic redshifts measured by the SDSS or by other surveys, as described below. We imposed a magnitude limit of r < 23.0 on the spectroscopic sample and applied additional cuts on the quality of the spectroscopic redshifts reported by the different surveys. For each survey, we chose a redshift quality cut roughly corresponding to 90% redshift confidence or greater. The number of objects in each catalog (unique and nonunique) and the redshift quality cuts are given in Table 1. By unique we mean objects for which a single photometric measurement was made. Conversely, we refer to objects that have repeated photometric measurements as non-unique.

Survey Number of Objects Number of Unique Objects zspec Quality Cut
2SLAQ 52,842 11,426 qop >= 3
CFRS 1,830 272 Class >= 1
CNOC2 21,123 1,435 -
TKRS 728 389 z > -1
DEEP + DEEP2 31,716 6,049 qz = A,B (DEEP); Q >= 3(DEEP2)
SDSS Spectroscopic Sample 531,672 531,672 zconf >= 0.9
All 639,911 551,243 typically 90% confidence
Table 1: Various surveys comprising the spectroscopic sample used for the photo-z training set.

In Fig. 1, we provide the r magnitude as well as the g-r and r-i color distributions for the various spectroscopic samples. Notice that in combination they provide good coverage of the corresponding photometric sample distributions.

Fig.1: Distributions of r magnitude as well as g-r and r-i colors for the various catalogs comprising the spectroscopic sample and the SDSS photometric sample.


Methods



Photo-z


We estimate the photo-z's with empirical photo-z methods using the training sets described below. Photo-z's were computed using the the Artificial Neural Network (ANN) method (Collister & Lahav 2004) and with the Nearest Neighbor Polynomial (NNP) technique (Cunha et al 2007). To avoid over-fitting the neural network to the training set, we use the technique of early stopping. We split the spectroscopic sample evenly into two independent parts: the training and the validation set. The formal minimization is performed on the training set. After each minimization step, the network is evaluated on the validation set. The network configuration that best performs on the validation set is chosen as the final network. For further details of our implementation of these photo-z methods we refer the user to our SDSS DR6 photo-z paper (Oyaizu et al 2007a).
Note that only the ANN photo-z's are publicly available.

Photo-z errors


The photo-z errors were estimated using the Nearest Neighbor Error method, described in more detail in (Oyaizu et al 2007a,b).

Estimating the Redshift Distribution


As an additional consistency check, we also estimate the true underlying redshift distribution of the photometric sample using an independent weighting method that does not rely on the photo-z's. This is done by simply weighting galaxies in the spectroscopic sample such that the weighted magnitude/color distributions of this sample match those of the photometric sample. The weighted redshift distribution of the spectroscopic sample then provides an estimate of the true redshift distribution of the photometric sample. Details can be found in Oyaizu et al (2007a) and Lima et al (2007).

Results



Photo-z's


We characterize the quality of the photo-z estimates using 4 photo-z performance metrics, defined in Table 2.

Performance Metric Definition
zbias 1/N &Sigma (ziphot-zispec)
&sigma2 1/N &Sigma (ziphot-zispec)2
&sigma68 Range in |zphot-zspec| containing 68% of objects
&sigma95 Range in |zphot-zspec| containing 95% of objects
Table 2.: Photo-z performance metrics.

To search for an optimal photo-z estimator, we computed photo-z's using the ANN method with different combinations of input photometric observables. The 2 cases that we present in the public data are described in Table 3. Other cases can be found in Oyaizu et al. (2007a).

Case Inputs/Description &sigma &sigma68
D1 ugriz + cucgcrcicz ; Split training in r 0.0519 0.0209
CC2 u-g, g-r, r-i, i-z + cgcrci 0.0593 0.0245
Table 3: ANN cases using different input parameters and training procedures. ugriz denote the magnitudes in the five passbands and differences between them denote the corresponding colors. Likewise cucgcrcicz denote the corresponding concentration indices, defined below.

In Fig. 2, we show zphot plotted as a function of zspec for our validation set objects, using the two ANN cases described above (D1 and CC2) and also using the NNP photo-z. In the latter case we find neighbors in color space and use them to fit a relation between magnitudes, concentration indices and redshift, which is then applied to the validation-set object of interest. Both ANN D1 and CC2, as well as NNP, seem to agree well on the validation set. The concentration index for a given passband is defined as the ratio of PetroR50 and PetroR90, which are the radii that respectively encircle 50% and 90% of the Petrosian flux.

Fig. 2: zphot as a function of zspec for the ANN (D1 and CC2 cases) and the NNP method, in different bins of r magnitude.

In Fig. 3 we show the performance metrics zbias, &sigma, and &sigma68 as a function of r magnitude for the validation set for the D1 and CC2 ANN cases.

Fig. 3: The Performance metrics zbias, &sigma, and &sigma68 for the validation set are shown as a function of r magnitude for the D1 and CC2 cases.

In Fig. 4 we plot zbias, &sigma, and &sigma68 as a function of zspec for the validation set, and show results for the D1 and CC2 cases and for both r < 20 and r > 20.

Fig. 4: The Performance metrics zbias, &sigma, and &sigma68 for the validation set are shown as a function of zspec for the D1 and CC2 cases. The increased scatter for objects with zspec > 0.6 is due to the 4000 Angstrom break shifting out of the r passband at around zspec = 0.7; beyond that redshift, the estimator effectively relies on only two passbands (i and z) to determine the photo-z's. Note that faint objects r > 20 have worse scatter at low redshifts for both cases. This is likely due to the fact that the faint, low-redshift objects in the validation set are predominantly blue dwarf or irregular galaxies that do not have strong 4000 Angstrom breaks; in this case, the photo-z estimator must rely on less pronounced spectral features, resulting in larger photo-z scatter.

In Fig. 5, we plot g-r color versus spectroscopic redshift for the validation set, for both bright (r < 20) and faint (r > 20) galaxies. The 2SLAQ and DEEP2 galaxies are highlighted by different colors, and the expected color-redshift relations for the four spectral templates from Coleman et al. (1980) (from early to late types) are indicated by the solid lines.

Fig. 5: g-r color vs zspec for galaxies in the validation set: left panel: galaxies with r < 20; right panel: galaxies with r > 20. The solid curves show expected color-redshift relations of galaxies with different SED types, calculated using the Coleman et al. (1980) spectral templates. The different colors indicate galaxies from the different spectroscopic surveys contributing to the validation set. The 2SLAQ objects, denoted by red triangles, were selected to be mostly early-type galaxies. They are responsible for the minimum in &sigma vs. zspec for the r > 20 subsample in Fig. 4.


Redshift Distributions


We define two additional metrics to quantify the quality of the predicted photo-z distribution. The first metric, &sigmadist is defined as
          &sigmadist = 1/Nbin &Sigma (Piphot - Pispec)2

where Piphot is the height of the ith redshift bin of the zphot distribution, Pispec is the height of the ith redshift bin of the zspec distribution, and Nbin is the total number of bins used. In the results shown here we used Ndiv=120 equally spaced redshift bins running from z = 0 to z = 2.

The second metric we employ is the KS statistic D, the maximum value of the absolute difference between the two (zphot and zspec) cumulative redshift distribution functions. An advantage of the KS statistic is that it uses unbinned data. However, our use of the KS statistic to quantify the difference between the zphot and zspec distributions of the validation set likely does not adhere to formal statistical practice, since it turns out that the probability for the KS statistic for both cases we consider is very close to zero (Press et al. 1992).
In Table 4 we show the values of &sigmadist and the statistics for CC2 and D1. Notice that CC2 usually outperforms D1 in the fainter bins but is often worse in brighter bins.

r magnitude &sigmadist (CC2) &sigmadist (D1) KS statistic (CC2) KS statistic (D1)
r < 18 0.0392 0.0330 0.0632 0.0391
18 < r < 19 0.0390 0.0430 0.0520 0.0533
19 < r < 20 0.0391 0.0399 0.0366 0.0413
20< r < 21 0.0403 0.0471 0.0363 0.0665
21< r < 22 0.0652 0.0702 0.1051 0.1306
All 0.0383 0.0338 0.0485 0.0307
Table 4: &sigmadist and KS statistics for CC2 and D1 in different r magnitude bins.

The redshift distributions for the validation set are shown in Fig. 6 for the same bins of r magnitude. The filled histograms correspond to the zphot distributions and the lines correspond to the zspec distributions.

Fig. 6: Redshift distributions for the galaxies in the validation set for different r magnitude bins. Left panels: ANN D1; Right panels: ANN CC2.

In Fig. 7 we show the estimated redshift distributions, for both the CC2 and D1 ANN cases, of a random subset containing ~ 1% of the objects in the full DR6 photometric sample. The solid lines in these plots correspond to the redshift estimate derived from the weighting techique described above (see the section 'Estimating the Redshift Distribution').

Fig. 7: Estimated redshift distributions for a random subsample of 1% of the galaxies in the DR6 photometric sample, in different r magnitude bins. Left panels: ANN D1; right panels: ANN CC2.


Photo-z Errors


Fig. 8 shows the performance of the photo-z error estimator by plotting the computed NNE error &sigmazNNE as a function of the corresponding empirical error for the validation set. Results are shown for the D1 and CC2 ANN photo-z's.

Fig. 8: The estimated error from the NNE method, &sigmazNNE, is shown against the empirical error for objects in the validation set. Left panel: D1 ANN; right panel: CC2 ANN. Each point corresponds to a bin of 100 objects with similar &sigmazNNE. The black squares show results for bright objects (r < 20), the red triangles for faint objects (r > 20). As expected, faint objects have larger errors, but the NNE error correlates well with the empirical error over the full magnitude range.

In Fig. 9, we plot the normalized error distribution, i.e., the distribution of (zphot-zspec)/&sigmazNNE, for objects in the spectroscopic sample, using the D1 ANN estimator. The solid black lines are the data, and the dotted red lines show Gaussian distributions with zero mean and unit variance. The upper panels show results for the galaxies in the SDSS Main and LRG spectroscopic samples. The lower panels show results for all validation-set galaxies, divided into bright (r < 20) and faint (r > 20) samples.

Fig. 9: Distributions of (zphot-zspec)/&sigmazNNE, for objects in the spectroscopic sample, with photo-z's calculated using ANN D1; the results for ANN CC2 are very similar. The solid black lines are the data, and the dotted red lines are Gaussians with zero mean and unit variance. Top left: SDSS Main spectroscopic sample; top right: SDSS LRG sample; bottom left: validation-set galaxies with r < 20; bottom right: validation-set galaxies with r > 20.


Flags and Caveats


When querying the SDSS data server to produce the photometric sample for which we estimated photo-z's, we set the most relevant flags needed to produce a clean galaxy sample. Some applications may require more stringent selection of objects. We advise users of the catalog to read the documentation about producing a clean galaxy sample on the SDSS website http://cas.sdss.org/dr6/en/help/docs/algorithm.asp?key=flags.
In particular, users should consider requiring the BINNED1 flag (object detected at > 5&sigma) and removing objects with the NODEBLEND flag (object is a blend but deblending was not possible). Finally, we note that the training of the photo-z estimators used only galaxies, not stars. As a result, photo-z estimates for stars that contaminate the photometric galaxy sample will be wrong, and cutting out objects with low zphot will not remove such stars.

Recommendations


The D1 photo-z estimates have lower photo-z scatter for bright galaxies r < 20, and scatter similar to but slightly smaller than that of CC2 for faint objects with r > 20. However, for faint galaxies r > 20, we recommend using the CC2 photo-z estimate, since the CC2 zphot distribution most closely resembles the zspec distribution of the validation set (Fig. 6), as well as the weighted zspec estimate for the redshift distribution of the full photometric sample (Fig. 7). We also recommend using the CC2 photo-z's if one wants to use photo-z's from the same method for the full DR6 sample.

We believe that CC2 is the better photo-z estimate despite the zphot vs. zspec plots seen above (e.g., Fig. 2). When comparing the photo-z estimates on spectroscopic data not included in the training or validation sets we have found that D1 photo-z's tend to systematically overestimate the true redshifts. The overestimates correlate noticeably with magnitude and are comparable to the scatter in the photo-z's. As of yet, we do not have a large enough sample to precisely quantify the bias, but we do have an understanding of its causes. The training set photo-z estimators are bayesian estimators, i.e. they derive the best redshift estimate by using the full posterior probability distribution that an object has a particular redshift given the photometric observables. The posterior distribution is a convolution of the likelihood and the priors. When the errors in the data are large (particularly at fainter magnitudes), the posterior distribution becomes more influenced by the priors. The priors in the training set predict that magnitudes grow with redshift, hence the estimators that use magnitudes can become biased if the magnitude-redshift relation of the training set is not representative of the photometric sample. We suspect that this is indeed the case, probably caused by spectroscopic surveys selectively targeting intrinsically brighter objects (e.g. the SDSS LRG sample) at higher redshifts.

Another important point to note is that the photo-z's in our catalog were not optimized for any specific galaxy sample, so specific types of objects may have a bias even though the overall galaxy distribution is unbiased. An example of a potentially problematic sample is that of galaxies in clusters. We are currently working on creating a photo-z catalog optimized for cluster galaxies.

Accessing the Catalog


The photo-z catalog can be accessed from the photoz2 table in the DR6 context on the SDSS CasJobs site, at http://casjobs.sdss.org/casjobs/ . We describe the columns of the photoz2 table in CasJobs in Table 5.

Column name Type Description
objID bigint unique ID pointing to PhotoObjAll table
photozcc2 real photometric redshift using ANN-CC2 method
photozerrcc2 real &sigma68 error estimate for ANN-CC2 photo-z
photozd1 real photometric redshift using ANN-D1 method
photozerrd1 real &sigma68 error estimate for ANN-CC2 photo-z
flag int 0 = objects with r<=20, 2 = objects with r>20.
Table 5: Description of columns of the photoz2 table in CasJobs.

A query similar to the one in the Photometric Selection section provides all objects for which we computed photo-z's. Alternatively, one can simply perform a query that searches for objects with a photoz2 entry.

In addition to the photoz2 table in the SDSS CAS, an independent photoz table is also available, for which the photo-z's have been computed using a template-based technique; see Csabai et al. (2007), Adelman-McCarthy et al (2007).

References


Adelman-McCarthy et al 2007, in press.
Collister, A. A. & Lahav, O. 2004, PASP, 116, 345
Csabai et al. 2007, in prep.
Cunha, C., Oyaizu, H., Lima, M., Sheldon, E., Lin, H., Frieman, J., 2007 in prep.
Lima, M., Cunha, C., Oyaizu, H., Sheldon, E., Lin, H., Frieman, J., 2007 in prep.
Oyaizu, H., Lima, M., Cunha, C., Lin, H., Frieman, J., Sheldon, E., 2007a in prep.
Oyaizu, H., Lima, M., Cunha, C., Lin, H., Frieman, J., 2007b in prep.
Press, W. H. et al. 1992, Numerical Recipes in C: The Art of Scientific Computing (Cambridge University Press)