Background

Column

Abstract

Satellite numbers have increased exponentially in recent years. Interestingly, over half of the operational satellites recorded in 2023 were SpaceX Starlink satellites. This study investigated the orbits of satellites in low Earth orbit (LEO) to identify possible differences between Starlink satellites and other satellites using a database of all operational satellites in 2023 collected by the Union of Concerned Scientists. After selecting variables and data cleaning, logistic regression, k-nearest neighbors, and random forest models were fitted/trained on the data, and performance metric comparisons were performed on both training and test data. The logistic regression model was determined to be the best model, balancing simplicity and accuracy, verified using decision boundary plots. From this model, it was found that Starlink satellites tend to have lower altitude, lower inclination orbits, even when compared to two satellite communication constellations. This suggested that Starlink satellites’ orbits were designed with a different approach to satellite communication constellations, providing unique insight and posing questions on what future LEO satellite constellations will be like.

Terminology: Rocket science in 1 minute

Different satellites orbit the Earth in different ways, and there are some terms that are commonly used to describe them (Union of Concerned Scientists, UCS Satellite Database User’s Manual 1-1-17):

Circular orbits: Satellites in these orbits stay at relatively the same distance from the Earth and orbit the Earth in a circular fashion. In these orbits, there are three main sub-categories:
- Low Earth orbit (LEO): Satellites that are approximately less than 1,700 km (1,056 mi) above the Earth’s surface.
- Medium Earth orbit (MEO): Satellites that are approximately between 1,700 km (1,056 mi) and 35,700 km (22,183 mi) above the Earth’s surface.
- Geosynchronous orbit (GEO): Satellites that are approximately 35,700 km (22,183 mi) above the Earth’s surface and stay stationary or almost stationary above a certain point on the Earth’s surface.
Elliptical orbits: Satellites in these orbits spend some time of their orbit closer to the earth and some time of their orbit farther from the Earth in a more egg-shaped orbit.
Prograde: Satellites in prograde orbits appear from the ground to fly around the Earth generally from west to east, going in the same direction as the Earth’s rotation.
Retrograde: Satellites in retrograde orbits appear from the ground to fly around the Earth generally from east to west, going in the opposite direction as the Earth’s rotation.

Column

What’s up?

Well, a lot of satellites! In 2023, there were around 7600 operational satellites around the Earth (Sebastian). Currently, the oldest operational satellite is Amsat-Oscar 7, which was launched in 1974, and many satellites have been launched from a multitude of countries since then. However, a trend has emerged in recent years. As shown in the graph below, of all of the currently operational satellites in low Earth orbit (LEO), over half of the satellites are SpaceX Starlink satellites, all of which were launched in or after 2019. SpaceX Starlink satellites make up a low Earth orbit satellite constellation, or a network of satellites that communicate with ground stations or each other, aimed at providing satellite internet services around the world (Starlink). But, are these satellites different than the other satellites out there? For example, if they have different orbital inclinations, or the angle of the orbit relative to the equator, then this could indicate different areas of the world that are covered by many of the satellites. Also, the apogee, or the approximate maximum height of a satellite above the the Earth’s surface, could also indicate a different approach to the constellation. Some studies have mentioned general trends of these parameters (Nitinder Mohan), but the current study utilizes orbital data from all operational satellites to perform this analysis. This study investigates whether there is a difference between Starlink satellites and other satellites and what implications any similarities or differences might have.

Operational LEO satellites by launch year and type

Number of starlink and other LEO satellites in operation.

Research question

This study answers one major research question:

Are the orbits of Starlink satellites different than other satellites in low Earth orbit?

This leads to two subsequent questions:

If they are different, what are key characteristics that define Starlink satellites’ orbits?
What does this tell us about these satellites and possibly other satellites, especially other satellite communication constellations?

EDA

Column

Source

The satellite data was put out by the Union of Concerned Scientists (UCS) and contained information on all 7560 satellites that were operational as of 5/1/2023 (Sebastian). The Union of Concerned Scientists is a nonprofit organization intended to provide independent information on a variety of topics (Union of Concerned Scientists, About).

Data cleaning

There were a few issues with the data as imported:

In one record, “Vandenberg AFB” was misspelled as “Vandeberg AFB”
In one record, The orbit class “LEO” was misspelled as “LEo”
Two records (Angosat-2 and WFOV Testbed (Wide Field of View Testbed, USSF 12) ) had apogees that were in the range for GEO satellites, but were incorrectly labelled as LEO satellites in their orbit class. From the sources provided in the data set, it appeared that these satellites were GEO satellites, so the class of these entries were changed and they were excluded from this study given that they were not LEO satellites.
Multiple records were classified as “LEO” but had orbital periods greater than 2 hours, which violates the definition of LEO in the source. Also, multiple records classified as “LEO” satellites had eccentricity values much greater than the 0.14 limit set by the source for nearly circular orbits like low Earth orbit. Since there were sufficient other entries, all satellites with orbital periods greater than 2 hours and/or eccentricity values greater than or equal to 0.14 were removed.
One record (KX-09) had a listed period of less than 10 minutes. This is physically improbable and is most likely a typo, so this entry was removed by removing all satellites with periods of less than 10 minutes.
After importing the data, the last row was an entirely blank row, which was removed.

These issues were fixed before any analysis was performed on the data.

Variable generation:

sat_name: Lowercase satellite names with no special characters. Introduced 55 NA values, but none were Starlink satellites from manual inspection.
is_starlink: Logical vector, true if sat_name included “starlink”.
sat_type: Character vector, “Starlink” for Starlink satellites, “Other” if otherwise.

Variable selection

There were over 30 different variable columns in the data, but only a subset were used. The goal of this project was to find the difference in how Starlink satellites orbited the Earth compared to other LEO satellites, so only information on the orbit that pertained to LEO satellites was retained. The class of orbit (LEO, GEO, etc.) was used to subset just the LEO satellites. While kept in the data, the type of orbit variable (non-polar inclined, etc.) was not used for prediction since some categories had only a few entries for LEO satellites and the other orbital information was sufficient. Some other variables were kept as reference for later inspection, but were not use for prediction. All rows containing missing information in the kept variables were removed (see code for details). See the Methods page for the total number of observations in the final data set.

Variables used for prediction

Information on these variables came from the user manual for the data (Union of Concerned Scientists, UCS Satellite Database User’s Manual 1-1-17).

Starlink/Other (Response variable): Generated variables with classes “Starlink” (TRUE) and “Other” (FALSE), encoded in variables sat_type and is_starlink.
Class of Orbit: LEO, MEO, GEO, or elliptical. Only LEO satellites were used in this study.
Perigee: The closest the satellite gets to the Earth’s center of mass, measured as the altitude above the Earth’s surface in kilometers.
Apogee: The farthest the satellite gets to the Earth’s center of mass, measured as the altitude above the Earth’s surface in kilometers.
Eccentricity: A measure of how elliptical the orbit is (0 = perfectly circular), dimensionless.
Inclination: The angle between the plane of Earth’s equator and the plane of the satellite’s orbit, measured in degrees. 0° means the satellite orbits around the equator in a prograde orbit, 90° means the satellite orbits the poles, and > 90° means the satellite orbits in a retrograde orbit.
Period: How long it takes a satellite to perform 1 full orbit around the Earth, measured in minutes.

Column

Sample entries

Below are 20 sample entries from the data set. 10 of them are Starlink satellites, and 10 of them are other satellites.

Box plots

Pairwise scatter plots

Methods

Column

Theory

Logistic Regression

Binary logistic regression is a type of generalized linear model that can be used to distinguish between two categories of a categorical variable. To do this, the expected value of a term called the log likelihood can be modelled as a linear combination of the predictors:

\[ \ln\left(\frac{p}{1-p}\right)=\beta_0+\beta_1 x_1 +\beta_2 x_2 +\cdots+\beta_k x_k\qquad, \]

where \(p\) is the probability of the response being the positive category. This can be fitted to the data and the expected value of the odds ratio \(OR=\frac{p}{1-p}\) can be written as:

\[ OR=e^{\beta_0+\beta_1 x_1 +\beta_2 x_2 +\cdots+\beta_k x_k}\qquad. \]

From this, a one-unit increase in one of the predictors \(x_i\) while holding all others constant results in an odds ratio that is \(e^{\beta_i}\) times the original odds ratio. Similarly, a 10 unit increase in \(x_i\) results in an odds ratio that is \(e^{10\beta_i}\) times the original odds ratio, and a one unit decrease in \(x_i\) results in an odds ratio that is \(e^{-\beta_i}\) times the original odds ratio.

To fit this model, the function glm from the stats package in R was utilized (see code for details). No centering or scaling was performed on the predictors with this model to provide the simplest interpretation.

K-Nearest Neighbors

K-nearest neighbors classification is a type of machine learning (ML) algorithm that classifies a predicted point based on the \(k\) closest points in the training data (scikit-learn, 1.6. Nearest Neighbors). The train function from the caret package in R was utilized with method = "knn". Centering and scaling was performed on the predictors with this model to improve the model fit.

Random Forest

Random forest classification is a type of machine learning (ML) algorithm that utilizes an ensemble of partially randomized decision trees to predict the response (scikit-learn, 1.11. Ensembles). The train function from the caret package in R was utilized with method = "rf", and 500 trees were utilized in the model to provide a balance between forest size and simplicity. Centering and scaling was performed on the predictors with this model to improve the model fit.

Selection process

A modified backwards selection process was utilized to select the final logistic regression model. The process proceeded as follows:

Start with a full model with all predictors: inclination, apogee, perigee, eccentricity, and period.
Check for multicollinearity using variance inflation factors (VIF); if multicollinearity is present, remove one variable heuristically.
Repeat step 2 until no multicollinearity is present.
Remove any variables with p-values greater than 0.05 for marginal tests on the coefficients.

This process was performed and repeated until two conditions were met:

The final included variables contributed significantly to the model.
The final included variables had physical significance and straightforward interpretations.

This allowed the model to have both high performance and meaningful interpretation. During this process, variable permutation importance from equivalent random forest models were assessed using the caret function varImp (Kuhn) to check the importance of the variables with a different modeling method. This helped support the decisions made from the results of the intermediate logistic regression models and aided in the final model selection.

Column

Stratified train-test split

Before fitting/training the models, all rows containing missing information in the kept variables were removed (see code for details). This resulted in 6439 observations in the data set. Subsequently, a stratified train-test split was performed on the data to provide representative proportions of the two classes in the training and test data using the caret function createDataPartition. 75% of the data was kept for the training data.

Tuning

Tuning was performed on models with available parameters. During the tuning process, the results from 10-fold cross validation with 3 repeats were averaged, and the accuracy was used as the performance metric to be maximized (positive = “Starlink”). The caret function trainControl was utilized to control this process (see code for details).

Logistic

No tuning was performed.

K-Nearest Neighbors

The number of nearest neighbors, \(k\), was tuned.

Random forest

Since only 2 variables were present in the final model, no tuning parameters were available and the model was not tuned.

Performance metrics

Specific performance metrics for just the logistic regression were utilized:

Goodness of fit: An analysis of variance (ANOVA) likelihood ratio test comparing the final model to a null model was performed for statistical significance using the stats anova function.

Pseudo-R²: McFadden, Cox & Snell, and Nagelkerke pseudo-R² metrics were utilized to gauge the performance of the model using the pR2 function from the pscl package.

Multicollinearity: Variance inflation factors (VIF) were computed to assess multicollinearity using the check_collinearity function from the performance package.

Other performance metrics were utilized to compare all three models:

Accuracy: The proportion of all cases identified correctly as positive (Starlink satellites) or negative (other satellites).

Sensitivity (Recall): The proportion of positive cases (Starlink satellites) identified correctly.

Specificity: The proportion of negative cases (other satellites) identified correctly.

Kappa: A metric that describes how much the results agree beyond random guessing.

All metrics were extracted using values from the caret functions confusionMatrix, sensitivity, and specificity. The metrics were computed for both the training data and the test data partitions.

Finally, receiver operator curves (ROC) and their corresponding areas under the curve (AUC) values were computed for all three models for the test data to investigate the performance of the models.

Results

Column

Logistic process

To fit the logistic model, all available variables were included. Following the selection process defined on the Methods page, the model was reduced until no multicollinearity was present, which reduced the model to having only two variables. It was found from both the logistic regression model p-value for models without multicollinearity and the random forest variable importance with all predictors that the inclination appeared to be a critical variable. The contenders for the second variable in the model were:

Apogee
Perigee
Period

Since low Earth orbit (LEO) is a type of nearly circular orbit (Union of Concerned Scientists, UCS Satellite Database User’s Manual 1-1-17), these are all related quantities (Moebs et al.). Because of this, apogee was chosen as the second variable in the final model due to ease of interpretation. The apogee is approximately the farthest a satellite gets from the Earth’s surface, so this could be easily visualized and interpreted. The equation and coefficients of the final model using inclination and apogee, the corresponding odds ratios (\(OR_i=e^{\beta_i}\) for coefficient \(\beta_i\)), and p-values are shown below:

Final model equation

\[ \ln\left(\widehat{\frac{p}{1-p}}\right)=14.05-0.09336 \cdot inclination -0.0121 \cdot apogee \]

Final model coefficients

Logistic performance

The performance of the final logistic regression model with inclination and apogee as the predictors was investigated with the following metrics. Short descriptions can be found on the Methods page.

Goodness of fit

The ANOVA table for the likelihood ratio test is shown below:

Analysis of Deviance Table

Model 1: starlink ~ 1
Model 2: starlink ~ inclination_degrees + apogee_km
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      4829     6412.5                          
2      4827     2590.4  2   3822.1 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicated that there is sufficient evidence to conclude that at least one of the predictors contributes to the model, suggesting that the model is able to substantially predict the type of satellite.

Pseudo-R²

McFadden pseudo-R² = 0.596 > 0.4, which indicates an excellent fit.
Cox & Snell pseudo-R² = 0.5468, which also suggests high performance.
Nagelkerke pseudo-R² = 0.744, which describes that the model achieves around 74% of the maximum possible improvement compared to the null model.

Multicollinearity

As shown in the output below, all VIF values were below 5 and well below 10, so multicollinearity was not an issue with this model:

# Check for Multicollinearity

Low Correlation

                Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
 inclination_degrees 1.03 [1.01, 1.08]     1.02      0.97     [0.92, 0.99]
           apogee_km 1.03 [1.01, 1.08]     1.02      0.97     [0.92, 0.99]

ML tuning

Both machine learning (ML) models were trained with inclination and apogee as the predictors. Since there were only two predictors, there were no parameters to tune for the random forest model. As shown below, the number of neighbors was tuned for the k-nearest neighbors algorithm based on optimizing the accuracy of the model. From this, 5 nearest neighbors was chosen as the optimal parameter value.

K-Nearest Neighbors Tuning

Column

ROC and AUC

ROC curves

AUC values

Logistic regression: 0.9139

K-nearest neighbors: 0.9973

Random forest: 0.9981

The ROC curves, with steep initial slopes, along with the AUC values near 1, suggest that all models perform well.

Performance comparison

The accuracy, sensitivity, specificity, and kappa performance metrics were calculated for all models with both the training data and the test data. While the logistic regression model was slightly lower than the k-nearest neighbors model and the random forest model, it still performed quite well relative to the ML models.

Plot of metrics on train and test data

Confusion matricies

All models performed well with very low percentages of the data with incorrect predictions.

Note: The values on the confusion matrices are the percentages of the total number of values in the respective data sets (4830 observations for the training data, 1609 observations for the test data).

All confusion matrices

Comparison

All three models performed very well, with all accuracy values above 90%. In all metrics, the logistic regression performed very similarly to the other ML models. This indicated that the logistic regression model could provide similar results with simple interpretation of the coefficients. Since the accuracy of this model was still very high, the model had very low rates of false positives and false negatives, and it performed vary similarly on the training and test data, it was determined that the logistic regression model would provide the best balance between complexity and accuracy and would be the best model for this study. However, the fact that both machine learning models had almost perfect accuracy scores hinted that some overfitting might be present. To investigate that, decision boundaries plots were constructed; see the next page for details.

Decision boundary

Column

Decision boundary plots

Decision boundary plots for all models. The decision boundary, plotted with the dashed contour lines, is the threshold where the probability of being a Starlink satellite is 0.5.

Column

Comparison

All models show clear boundaries (dashed contour lines) that capture much of the difference between the two classes. However, both plots for the k-nearest neighbors and random forest models have quite complicated decision boundaries. This provides a partial explanation for the very high performance shown in the results section; those metrics could be very high because the models might be overfitting the data. While a train-test split was performed, many satellites in a particular part of a constellation follow similar orbits, so it is possible that those models were overfitting since they might have been provided with enough satellites from these “bands” of satellites. However, the logistic regression model was still able to provide comparable performance with a very simple and easy to interpret linear decision boundary. Because of this, it was determined that the logistic regression model was the best model for interpretation.

Logistic interpretation

The logistic regression model indicates that lower apogees and lower inclinations tended to correlate with Starlink satellites. In fact, for the satellites in this study, a decrease in inclination of 10° corresponded to around 2.5 times higher odds of a satellite being a Starlink satellite for satellites with the same apogee. Similarly, a decrease in apogee of 100 km corresponded to around 3.4 times higher odds of a satellite being a Starlink satellite for satellites with the same inclination. But why is this important? See the Discussion page for the answer.

Discussion

Column

Conclusion

This study set out to answer the question, “Are the orbits of Starlink satellites different than other satellites in low Earth orbit?” The logistic model provided an answer by showing that lower inclinations and lower apogees tend to be associated with Starlink satellites, but why is this important? To understand this, two other LEO satellite communication constellations were compared to Starlink:

Iridium^® NEXT: A constellation of over 70 satellites providing global communication coverage (Iridium).
Eutelsat OneWeb: A constellation of around 600 LEO satellites also providing global communication coverage (OneWeb).

The plot below shows that Starlink is the only constellation out of these three that is mostly non-polar inclined. These non-polar-orbiting satellites do not reach the whole Earth, but only a section that excludes extremely north and south latitudes like Sweden, Alaska, and parts of Canada (Nitinder Mohan). This suggests that service could be prioritized to middle latitude customers. Will this be the trend of future satellite internet constellations? With Amazon starting it’s planned internet constellation of over 3000 LEO satellites (Kohnstamm) among others, only time will tell. However, this study exemplified that logistic regression with this data could provide a simple, accurate, and easy to understand model to describe the differences between Starlink satellites and other satellites operational in 2023. Whether one is an avid researcher or a nighttime stargazer, this study can help make the unique aspects of the Starlink constellation shine brightly like the stars in the sky.

Plot comparing orbit types

Limitations/future direction

The satellite landscape is rapidly changing and developing. Because of this, new information could come out that could greatly improve the quality of the study, limiting the long-term quality of the conclusions. When new information becomes available, a future direction would be to add this information to keep the conclusions relevant. Also, the models in this study compared individual satellites, not satellite constellations, which limits the ability of the conclusions to compare many full constellations. Another future improvement could be to group the satellites by constellations and possibly use models to try to predict the differences between more than 2 satellite groups, e.g. using multinomial logistic regression. Despite these limitations, this study was still able to provide an accurate and easy to interpret model to understand the differences between Starlink satellites and other satellites in 2023.

Column

References/acknowledgements

Works Cited

Iridium. Iridium NEXT: In Review. 2019, https://www.iridium.com/blog/iridium-next-review/.

Kohnstamm, Thomas. Everything You Need to Know about Amazon Leo, Amazon’s Satellite Broadband Network. 13 Nov. 2025, https://www.aboutamazon.com/news/innovation-at-amazon/what-is-amazon-project-kuiper.

Kuhn, Max. 15 Variable Importance. 27 Mar. 2019, https://topepo.github.io/caret/variable-importance.html.

Moebs, William, et al. “13.4 Satellite Orbits and Energy.” University Physics Volume 1, 2016, https://openstax.org/books/university-physics-volume-1/pages/13-4-satellite-orbits-and-energy.

Nitinder Mohan, Andrew Ferguson. A Multifaceted Look at Starlink Performance - the Good, the Bad and the Ugly. 2024, https://labs.ripe.net/author/nitinder-mohan/a-multifaceted-look-at-starlink-performance-the-good-the-bad-and-the-ugly/.

OneWeb. Oneweb LEO Constellation. https://www.eutelsat.com/satellite-network/oneweb-leo-constellation. Accessed 20 Nov. 2025.

scikit-learn. 1.11. Ensembles: Gradient Boosting, Random Forests, Bagging, Voting, Stacking. https://scikit-learn.org/stable/modules/ensemble.html#forest. Accessed 21 Nov. 2025.

———. 1.6. Nearest Neighbors. https://scikit-learn.org/stable/modules/neighbors.html#neighbors. Accessed 21 Nov. 2025.

Sebastian, Kyle Ann. UCS Satellite Database. 2024, https://www.ucs.org/resources/satellite-database.

Starlink. SATELLITE TECHNOLOGY. https://starlink.com/technology. Accessed 24 Nov. 2025.

Union of Concerned Scientists. About. https://www.ucs.org/about. Accessed 26 Nov. 2025.

———. UCS Satellite Database User’s Manual 1-1-17. 2017.

Much of the other information came from lectures in MTH 543 linear models, Fall 2025, taught by Dr. Ying-Ju Tessa Chen. The author would like to acknowledge her teaching in this class and her insightful advice on this project.

Note: AI tools were used to provide basic programming examples for particular functions. All other code and all other information was written by the author.

About the author

Lucas Hung is a graduate student at the University of Dayton pursuing a Masters of Science in Aerospace Engineering. He earned a Bachelors of Mechanical Engineering from the University of Dayton with a concentration in Aerospace and a minor in Music. Professionally, Lucas is passionate about aerodynamics, data analytics, machine learning, and other aerospace related topics. On the side, he likes to build and fly remote controlled aircraft and play different musical instruments.

LinkedIn:

https://www.linkedin.com/in/lucaskh/

---
title: "Star Night, Starlink Bright"
author: "Lucas Hung"
bibliography: bibl/bibl.bib
csl: modern-language-association-9th-in-text.csl
output:
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: flatly
      primary: "#35ba9f"
    orientation: columns
    vertical_layout: fill #scroll
    source_code: embed
---

```{=html}
<style>
.chart-title {  /* chart_title  */
   font-size: 18px;
  }
body{ /* Normal  */
      font-size: 16px;
  }
.navbar{
  --bslib-navbar-dark-bg:#2c3e4f
}
.ref-section{
  margin-left: 30px;
  text-indent: -30px;
}
</style>

<head>
    <base target="_blank">
</head>
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = FALSE,
  fig.align = "center",
  message = FALSE)


# use thematic to 
# make plots follow the theme of the page
thematic::thematic_rmd(
  font = "auto",
  sequential = NA,
  qualitative = NA)
```

# Background

## Column {data-width="500"}

### Abstract

Satellite numbers have increased exponentially in recent years. Interestingly, over half of the operational satellites recorded in 2023 were SpaceX Starlink satellites. This study investigated the orbits of satellites in low Earth orbit (LEO) to identify possible differences between Starlink satellites and other satellites using a database of all operational satellites in 2023 collected by the Union of Concerned Scientists. After selecting variables and data cleaning, logistic regression, k-nearest neighbors, and random forest models were fitted/trained on the data, and performance metric comparisons were performed on both training and test data. The logistic regression model was determined to be the best model, balancing simplicity and accuracy, verified using decision boundary plots. From this model, it was found that Starlink satellites tend to have lower altitude, lower inclination orbits, even when compared to two satellite communication constellations. This suggested that Starlink satellites' orbits were designed with a different approach to satellite communication constellations, providing unique insight and posing questions on what future LEO satellite constellations will be like.

### Terminology: Rocket science in 1 minute

Different satellites orbit the Earth in different ways, and there are some terms that are commonly used to describe them [@UCS:manual:2017]:

-   **Circular orbits:** Satellites in these orbits stay at relatively the same distance from the Earth and orbit the Earth in a circular fashion. In these orbits, there are three main sub-categories:

    -   **Low Earth orbit (LEO):** Satellites that are approximately less than 1,700 km (1,056 mi) above the Earth's surface.

    -   **Medium Earth orbit (MEO):** Satellites that are approximately between 1,700 km (1,056 mi) and 35,700 km (22,183 mi) above the Earth's surface.

    -   **Geosynchronous orbit (GEO):** Satellites that are approximately 35,700 km (22,183 mi) above the Earth's surface and stay stationary or almost stationary above a certain point on the Earth's surface.

-   **Elliptical orbits:** Satellites in these orbits spend some time of their orbit closer to the earth and some time of their orbit farther from the Earth in a more egg-shaped orbit.

-   **Prograde:** Satellites in prograde orbits appear from the ground to fly around the Earth generally from west to east, going in the same direction as the Earth's rotation.

-   **Retrograde:** Satellites in retrograde orbits appear from the ground to fly around the Earth generally from east to west, going in the opposite direction as the Earth's rotation.

## Column {.tabset data-width="500"}

```{r libraries, message=FALSE}
rm(list = ls())
pacman::p_load(broom, caret, DataExplorer, flexdashboard, GGally, janitor, NISTunits, patchwork, performance, plotly, pROC, pscl, rlang, see, tidyverse)

default_theme <- theme_minimal

options(ggplot2.discrete.fill = list(c("#2c3e4f","#FD9270"),c("#2c3e4f","#FD9270","#35ba9f","#CC5F5A")))
options(ggplot2.discrete.colour = list(c("#35ba9f","#CC5F5A")))
default_fill_continuous <- 
  partial(
    scale_fill_gradient,
    low = "#2c3e4f",
    high = "#FD9270")


default_linewidth <- 1
default_size <- 1
default_point_alpha_level <- 0.4

update_geom_defaults("point", list(size = default_size))
update_geom_defaults("line", list(linewidth = default_linewidth))

fig_height <- 3.2
fig_width <- golden_ratio(fig_height)


# Name of "Other" or "Starlink" legend entry
leg_name_strl <- "Satellite type"

# Name of density probability contour
leg_name_raster <- "Probability"
```

```{r read_data}
sat_full <- read_delim("data/UCS-Satellite-Database 5-1-2023_text.txt", 
                delim = "\t", escape_double = FALSE,
                trim_ws = TRUE,name_repair = "universal") |> 
  clean_names()
sat_full <- sat_full |> 
  select((!starts_with("x")))
sat_full <- sat_full |> 
  mutate(date_of_launch = mdy(date_of_launch))
  
# fix typo
sat_full <- sat_full |> 
  mutate(
    # Vandenberg typo
    launch_site = gsub(
      "Vandeberg","Vandenberg",launch_site
      ),
    # LEO as LEo
    class_of_orbit = toupper(
      class_of_orbit
      )
    
  )

# remove any rows with all NA values
sat_full <-
  sat_full[
    !apply(is.na(sat_full),1,all),
  ]

# Fix typos where GEO put as LEO
pick_rows <- 
  sat_full$apogee_km > 30000 & sat_full$class_of_orbit == "LEO"
pick_rows[is.na(pick_rows)] <- FALSE
sat_full[pick_rows,]$class_of_orbit <- "GEO"
```

```{r sat-full-mutate}
sat_full <- sat_full |> 
  mutate(
    sat_name = tolower(
        iconv(
        current_official_name_of_satellite,
        from = 'UTF-8', 
        to = 'ASCII//TRANSLIT')
      ),
    is_starlink = grepl("starlink",sat_name),
    sat_type = ifelse(is_starlink,
                      "Starlink",
                      "Other")
    ) 
```

### What's up?

Well, a lot of satellites! In 2023, there were around `r sprintf("%d",signif(nrow(sat_full),2))` operational satellites around the Earth [@UCS:2024]. Currently, the oldest operational satellite is `r sat_full |> slice_min(date_of_launch) |> (\(x) x$current_official_name_of_satellite)()`, which was launched in `r sat_full |> slice_min(date_of_launch) |> (\(x) year(x$date_of_launch))()`, and many satellites have been launched from a multitude of countries since then. However, a trend has emerged in recent years. As shown in the graph below, of all of the currently operational satellites in low Earth orbit (LEO), over half of the satellites are SpaceX Starlink satellites, all of which were launched in or after `r sat_full |> filter(is_starlink == TRUE) |> slice_min(date_of_launch,n = 1, with_ties = FALSE) |> (\(x) year(x$date_of_launch))()`. SpaceX Starlink satellites make up a low Earth orbit satellite constellation, or a network of satellites that communicate with ground stations or each other, aimed at providing satellite internet services around the world [@Starlink:2025]. But, are these satellites different than the other satellites out there? For example, if they have different orbital inclinations, or the angle of the orbit relative to the equator, then this could indicate different areas of the world that are covered by many of the satellites. Also, the apogee, or the approximate maximum height of a satellite above the the Earth's surface, could also indicate a different approach to the constellation. Some studies have mentioned general trends of these parameters [@Mohan:2024], but the current study utilizes orbital data from all operational satellites to perform this analysis. This study investigates whether there is a difference between Starlink satellites and other satellites and what implications any similarities or differences might have.

#### Operational LEO satellites by launch year and type

```{r plot-star-LEO, fig.cap="Number of starlink and other LEO satellites in operation.", fig.width=fig_width, fig.height=fig_height, out.width="100%"}
sat_LEO <-
  sat_full |> 
  filter(
    class_of_orbit == "LEO"
  ) |> 
  arrange(date_of_launch) |> 
  mutate(
    temp1 = 1,
    n_sat = cumsum(temp1),
    n_strl = cumsum(is_starlink),
    n_othr = cumsum(!is_starlink),
  )
sat_count_LEO <-
  sat_LEO |> 
  select(
    date_of_launch,
    n_othr,
    n_strl
  ) |> 
  pivot_longer(
    !date_of_launch
  ) |> 
  mutate(
    name = factor(
      recode(name,
        n_othr = "Other",
        n_strl = "Starlink"
      )
    ),
    name = relevel(name,
      ref = "Other"
    )
  )
sat_count_LEO |> 
  ggplot(
    aes(
      x = date_of_launch,
      y = value,
      colour = name
    )
  ) +
  geom_line(
    aes(
      linetype = name
    ),
  ) + 
  scale_y_log10()+
  labs(
    x = "Date of launch",
    y = "Satellites launched before this date",
    colour = leg_name_strl,
    linetype = leg_name_strl
  ) +
  default_theme()
```

### Research question

This study answers one major research question:

-   Are the orbits of Starlink satellites different than other satellites in low Earth orbit?

This leads to two subsequent questions:

-   If they are different, what are key characteristics that define Starlink satellites' orbits?

-   What does this tell us about these satellites and possibly other satellites, especially other satellite communication constellations?

# EDA

## Column {.tabset data-width="250"}

### Source

The satellite data was put out by the Union of Concerned Scientists (UCS) and contained information on all `r nrow(sat_full)` satellites that were operational as of 5/1/2023 [@UCS:2024]. The Union of Concerned Scientists is a nonprofit organization intended to provide independent information on a variety of topics [@UCS:about:2025].

#### Data cleaning

There were a few issues with the data as imported:

-   In one record, "Vandenberg AFB" was misspelled as "Vandeberg AFB"

-   In one record, The orbit class "LEO" was misspelled as "LEo"

-   Two records (Angosat-2 and WFOV Testbed (Wide Field of View Testbed, USSF 12) ) had apogees that were in the range for GEO satellites, but were incorrectly labelled as LEO satellites in their orbit class. From the sources provided in the data set, it appeared that these satellites were GEO satellites, so the class of these entries were changed and they were excluded from this study given that they were not LEO satellites.

-   Multiple records were classified as "LEO" but had orbital periods greater than 2 hours, which violates the definition of LEO in the source. Also, multiple records classified as "LEO" satellites had eccentricity values much greater than the 0.14 limit set by the source for nearly circular orbits like low Earth orbit. Since there were sufficient other entries, all satellites with orbital periods greater than 2 hours and/or eccentricity values greater than or equal to 0.14 were removed.

-   One record (KX-09) had a listed period of less than 10 minutes. This is physically improbable and is most likely a typo, so this entry was removed by removing all satellites with periods of less than 10 minutes.

-   After importing the data, the last row was an entirely blank row, which was removed.

These issues were fixed before any analysis was performed on the data.

#### Variable generation:

-   **sat_name:** Lowercase satellite names with no special characters. Introduced `r sum(is.na(sat_full$sat_name))` NA values, but none were Starlink satellites from manual inspection.
-   **is_starlink:** Logical vector, true if sat_name included "starlink".
-   **sat_type:** Character vector, "Starlink" for Starlink satellites, "Other" if otherwise.

### Variable selection

There were over 30 different variable columns in the data, but only a subset were used. The goal of this project was to find the difference in how Starlink satellites orbited the Earth compared to other LEO satellites, so only information on the orbit that pertained to LEO satellites was retained. The class of orbit (LEO, GEO, etc.) was used to subset just the LEO satellites. While kept in the data, the type of orbit variable (non-polar inclined, etc.) was not used for prediction since some categories had only a few entries for LEO satellites and the other orbital information was sufficient. Some other variables were kept as reference for later inspection, but were not use for prediction. All rows containing missing information in the kept variables were removed (see code for details). See the Methods page for the total number of observations in the final data set.

#### Variables used for prediction

Information on these variables came from the user manual for the data [@UCS:manual:2017].

-   **Starlink/Other (Response variable):** Generated variables with classes "Starlink" (TRUE) and "Other" (FALSE), encoded in variables sat_type and is_starlink.

-   **Class of Orbit:** LEO, MEO, GEO, or elliptical. Only LEO satellites were used in this study.

-   **Perigee:** The closest the satellite gets to the Earth's center of mass, measured as the altitude above the Earth's surface in kilometers.

-   **Apogee:** The farthest the satellite gets to the Earth's center of mass, measured as the altitude above the Earth's surface in kilometers.

-   **Eccentricity:** A measure of how elliptical the orbit is (0 = perfectly circular), dimensionless.

-   **Inclination:** The angle between the plane of Earth's equator and the plane of the satellite's orbit, measured in degrees. 0° means the satellite orbits around the equator in a prograde orbit, 90° means the satellite orbits the poles, and \> 90° means the satellite orbits in a retrograde orbit.

-   **Period:** How long it takes a satellite to perform 1 full orbit around the Earth, measured in minutes.

## Column {.tabset data-width="750"}

```{r pick-vars}
sat_pick <- sat_full |>
  select(
    c(
      # ID info, not used for regression
      name_of_satellite_alternate_names,
      current_official_name_of_satellite,
      cospar_number,
      norad_number,
      sat_name,
      # Info used to select LEO satellites
      class_of_orbit,
      # Variables for model inputs
      perigee_km,
      apogee_km,
      eccentricity,
      inclination_degrees,
      period_minutes,
      # Info on satellite type (Starlink/Other)
      is_starlink,
      sat_type,
      # Info kept but not used for model inputs
      type_of_orbit,
      launch_mass_kg,
      users,
      purpose,
      date_of_launch,
    )
  ) |> 
  filter(
    class_of_orbit == "LEO",
    period_minutes <= 120 & period_minutes > 10,
    eccentricity < 0.14)
```

### Sample entries

Below are 20 sample entries from the data set. 10 of them are Starlink satellites, and 10 of them are other satellites.

```{r table-examples}
set.seed(300)
sat_sample <-
  sat_pick |> 
  slice_sample(n = 10, by = sat_type)
sat_sample_show <-
  sat_sample |> 
  select( # ID variables, kept but noted as ID's
    c(
      current_official_name_of_satellite,
      perigee_km,
      apogee_km,
      eccentricity,
      inclination_degrees,
      period_minutes,
      sat_type
    )
  )

DT::datatable(
  sat_sample_show |> 
    mutate(
      across(
        where(
          is.numeric
        ),
        \(x) sprintf("%.4G",x)
      )
    ), 
  rownames = FALSE, 
  colnames = c(
    "Satellite name",
    "perigee (km)",
    "apogee (km)",
    "eccentricity",
    "inclination (degrees)",
    "period (minutes)",
    "satellite type")
)
```

### Box plots

```{r plot-box, fig.width=fig_width*1.5, fig.height=fig_height*1.5,out.width = "100%"}
sat_pick |>
  select( # ID variables, kept but noted as ID's
    !c(norad_number,
       cospar_number,
       name_of_satellite_alternate_names,
       current_official_name_of_satellite,
       class_of_orbit,
       date_of_launch,
       sat_name,
       is_starlink)
    ) |>
  select( # not removed, but not used due to other reasons
    !c(
      users,
      purpose,
      launch_mass_kg,
      type_of_orbit
    )
  ) |>
  plot_boxplot(
    by = "sat_type",
    nrow = 2,
    ncol = 3,
    ggtheme = default_theme() +
      theme(text = element_text(size = 20),
            axis.title = element_text(size = 15),
            margins = margin(r = 10)))
```

### Pairwise scatter plots

```{r plot-pairs, fig.width=fig_width*1.5, fig.height=fig_height*1.5}
sat_pick |>
  select( # ID variables, kept but noted as ID's
    !c(norad_number,
       cospar_number,
       name_of_satellite_alternate_names,
       current_official_name_of_satellite,
       class_of_orbit,
       date_of_launch,
       sat_name,
       is_starlink,
       sat_type)
    ) |>
  select( # not removed, but not used due to other reasons
    !c(
      users,
      purpose,
      launch_mass_kg,
      type_of_orbit
    )
  ) |>
  ggpairs(
    mapping = aes(colour = sat_pick$sat_type,
                  shape = sat_pick$sat_type),
    lower = list(continuous = wrap("points", alpha = default_point_alpha_level))) +
    default_theme()
```

# Methods

## Column {.tabset}

### Theory

#### Logistic Regression

Binary logistic regression is a type of generalized linear model that can be used to distinguish between two categories of a categorical variable. To do this, the expected value of a term called the log likelihood can be modelled as a linear combination of the predictors:

$$
\ln\left(\frac{p}{1-p}\right)=\beta_0+\beta_1 x_1 +\beta_2 x_2 +\cdots+\beta_k x_k\qquad,
$$

where $p$ is the probability of the response being the positive category. This can be fitted to the data and the expected value of the odds ratio $OR=\frac{p}{1-p}$ can be written as:

$$
OR=e^{\beta_0+\beta_1 x_1 +\beta_2 x_2 +\cdots+\beta_k x_k}\qquad.
$$

From this, a one-unit increase in one of the predictors $x_i$ while holding all others constant results in an odds ratio that is $e^{\beta_i}$ times the original odds ratio. Similarly, a 10 unit increase in $x_i$ results in an odds ratio that is $e^{10\beta_i}$ times the original odds ratio, and a one unit decrease in $x_i$ results in an odds ratio that is $e^{-\beta_i}$ times the original odds ratio.

To fit this model, the function `glm` from the `stats` package in R was utilized (see code for details). No centering or scaling was performed on the predictors with this model to provide the simplest interpretation.

#### K-Nearest Neighbors

K-nearest neighbors classification is a type of machine learning (ML) algorithm that classifies a predicted point based on the $k$ closest points in the training data [@skLearn:neighbor:2025]. The `train` function from the `caret` package in R was utilized with `method = "knn"`. Centering and scaling was performed on the predictors with this model to improve the model fit.

#### Random Forest

Random forest classification is a type of machine learning (ML) algorithm that utilizes an ensemble of partially randomized decision trees to predict the response [@skLearn:random-forest:2025]. The `train` function from the `caret` package in R was utilized with `method = "rf"`, and 500 trees were utilized in the model to provide a balance between forest size and simplicity. Centering and scaling was performed on the predictors with this model to improve the model fit.

### Selection process

A modified backwards selection process was utilized to select the final logistic regression model. The process proceeded as follows:

1.  Start with a full model with all predictors: inclination, apogee, perigee, eccentricity, and period.

2.  Check for multicollinearity using variance inflation factors (VIF); if multicollinearity is present, remove one variable heuristically.

3.  Repeat step 2 until no multicollinearity is present.

4.  Remove any variables with p-values greater than 0.05 for marginal tests on the coefficients.

This process was performed and repeated until two conditions were met:

1.  The final included variables contributed significantly to the model.

2.  The final included variables had physical significance and straightforward interpretations.

This allowed the model to have both high performance and meaningful interpretation. During this process, variable permutation importance from equivalent random forest models were assessed using the caret function `varImp` [@Kuhn:random-forest:2019] to check the importance of the variables with a different modeling method. This helped support the decisions made from the results of the intermediate logistic regression models and aided in the final model selection.

## Column {.tabset}

### Stratified train-test split

```{r split-data}
train_prop <- 0.75
sat_fit <- sat_pick |>
  mutate(
    starlink = factor(is_starlink)
    ) |>
    drop_na()
set.seed(100)
train_index <- createDataPartition(sat_fit$is_starlink,
                                   p = train_prop,
                                   list = FALSE)

sat_train <- sat_fit[train_index,]
sat_test <- sat_fit[-train_index,]
make_table <-
  as_tibble(
    cbind(
      rbind(
        table(sat_train$sat_type), 
        table(sat_test$sat_type)),
      c("Train","Test")
      ),
    .name_repair = "universal") |> 
  rename(name=...3) |> 
  relocate(name)
```

Before fitting/training the models, all rows containing missing information in the kept variables were removed (see code for details). This resulted in `r nrow(sat_fit)` observations in the data set. Subsequently, a stratified train-test split was performed on the data to provide representative proportions of the two classes in the training and test data using the caret function `createDataPartition`. `r sprintf("%.f",train_prop*100)`% of the data was kept for the training data.

```{r table-split}
DT::datatable(
  make_table, 
  rownames = FALSE, 
  colnames = 
    c(
      "Partition",
      "Starlink count",
      "Other count"
      )
)
```

### Tuning

Tuning was performed on models with available parameters. During the tuning process, the results from 10-fold cross validation with 3 repeats were averaged, and the accuracy was used as the performance metric to be maximized (positive = "Starlink"). The caret function `trainControl` was utilized to control this process (see code for details).

#### Logistic

No tuning was performed.

#### K-Nearest Neighbors

The number of nearest neighbors, $k$, was tuned.

#### Random forest

Since only 2 variables were present in the final model, no tuning parameters were available and the model was not tuned.

### Performance metrics

Specific performance metrics for just the logistic regression were utilized:

**Goodness of fit:** An analysis of variance (ANOVA) likelihood ratio test comparing the final model to a null model was performed for statistical significance using the stats `anova` function.

**Pseudo-R^2^:** McFadden, Cox & Snell, and Nagelkerke pseudo-R^2^ metrics were utilized to gauge the performance of the model using the `pR2` function from the `pscl` package.

**Multicollinearity:** Variance inflation factors (VIF) were computed to assess multicollinearity using the `check_collinearity` function from the `performance` package.

Other performance metrics were utilized to compare all three models:

**Accuracy:** The proportion of all cases identified correctly as positive (Starlink satellites) or negative (other satellites).

**Sensitivity (Recall):** The proportion of positive cases (Starlink satellites) identified correctly.

**Specificity:** The proportion of negative cases (other satellites) identified correctly.

**Kappa:** A metric that describes how much the results agree beyond random guessing.

All metrics were extracted using values from the caret functions `confusionMatrix`, `sensitivity`, and `specificity`. The metrics were computed for both the training data and the test data partitions.

Finally, receiver operator curves (ROC) and their corresponding areas under the curve (AUC) values were computed for all three models for the test data to investigate the performance of the models.

# Results

## Column {.tabset}

### Logistic process

To fit the logistic model, all available variables were included. Following the selection process defined on the Methods page, the model was reduced until no multicollinearity was present, which reduced the model to having only two variables. It was found from both the logistic regression model p-value for models without multicollinearity and the random forest variable importance with all predictors that the inclination appeared to be a critical variable. The contenders for the second variable in the model were:

-   Apogee

-   Perigee

-   Period

Since low Earth orbit (LEO) is a type of nearly circular orbit [@UCS:manual:2017], these are all related quantities [@Sanny:orbits:2016]. Because of this, **apogee** was chosen as the second variable in the final model due to ease of interpretation. The apogee is approximately the farthest a satellite gets from the Earth's surface, so this could be easily visualized and interpreted. The equation and coefficients of the final model using **inclination** and **apogee**, the corresponding odds ratios ($OR_i=e^{\beta_i}$ for coefficient $\beta_i$), and p-values are shown below:

**Final model equation**

```{r fit-orbit-incl-apogee}
fit_orbit_incl_apogee <- sat_train |>
  glm(
    starlink~
      inclination_degrees+
      apogee_km,
    family = binomial,
    data = _)

# Obtain probability of predictions for the test data

test_prob_log <- predict(fit_orbit_incl_apogee,newdata = sat_test, type = "response")

# Summarize results
fit_coeffs_vec <- coef(fit_orbit_incl_apogee)
fit_coeffs_str <- sprintf("%.4G", fit_coeffs_vec)
fit_coeffs_df <-
  tidy(fit_orbit_incl_apogee) |> 
  mutate(
    OR = exp(estimate)
  ) |> 
  mutate(
    term = case_match(term,
                      "(Intercept)" ~ "(Intercept)", 
                      "apogee_km"~"Apogee (km)",
                      "inclination_degrees"~"Inclination (degrees)")
    ) |> 
  select(
    c(
      term,
      OR,
      estimate,
      std.error,
      p.value
    )
  ) |> 
  mutate(
    across(
      where(is.numeric),
      \(x) sprintf("%.4G",x)
    )
  )

```

$$
\ln\left(\widehat{\frac{p}{1-p}}\right)=`r sprintf("%.4G", fit_coeffs_vec["(Intercept)"])``r ifelse(fit_coeffs_vec["inclination_degrees"]>=0,"+","")``r sprintf("%.4G", fit_coeffs_vec["inclination_degrees"])` \cdot inclination `r ifelse(fit_coeffs_vec["apogee_km"]>=0,"+","")``r sprintf("%.4G", fit_coeffs_vec["apogee_km"])` \cdot apogee
$$

**Final model coefficients**

```{r}
DT::datatable(
  fit_coeffs_df, 
  rownames = FALSE, 
  colnames = 
    c(
      "Term",
      "OR",
      "Estimate",
      "Standard error",
      "P-value"
      )
)
```

### Logistic performance

The performance of the final logistic regression model with inclination and apogee as the predictors was investigated with the following metrics. Short descriptions can be found on the Methods page.

#### Goodness of fit

The ANOVA table for the likelihood ratio test is shown below:

```{r goodness-log}
fit_null <- sat_train |>
  glm(starlink ~ 1,
      family = binomial,
      data = _)

anova(fit_null,fit_orbit_incl_apogee,test = "Chisq")
```

The small p-value indicated that there is sufficient evidence to conclude that at least one of the predictors contributes to the model, suggesting that the model is able to substantially predict the type of satellite.

#### Pseudo-R^2^

```{r pseudo-rsq}
invisible(
  capture.output(
    pseudo_rsq <- pR2(fit_orbit_incl_apogee)
  )
)
```

-   McFadden pseudo-R^2^ = `r sprintf("%.4G",pseudo_rsq["McFadden"])` \> 0.4, which indicates an excellent fit.

-   Cox & Snell pseudo-R^2^ = `r sprintf("%.4G",pseudo_rsq["r2ML"])`, which also suggests high performance.

-   Nagelkerke pseudo-R^2^ = `r sprintf("%.4G",pseudo_rsq["r2CU"])`, which describes that the model achieves around `r sprintf("%.2G",100*pseudo_rsq["r2CU"])`% of the maximum possible improvement compared to the null model.

#### Multicollinearity

As shown in the output below, all VIF values were below 5 and well below 10, so multicollinearity was not an issue with this model:

```{r log-multicoll}
check_collinearity(fit_orbit_incl_apogee)
```

### ML tuning

```{r model-KNN}
set.seed(100)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
fit_knn <- train(
  starlink~
      inclination_degrees+
      apogee_km,
      data = sat_train,
      method = "knn",
      trControl = ctrl,
      preProcess = c("center","scale"),
      tuneLength = 20)

# Obtain probability of predictions for the test data
test_prob_knn <- predict(fit_knn,newdata = sat_test, type = "prob")$"TRUE"
```

Both machine learning (ML) models were trained with inclination and apogee as the predictors. Since there were only two predictors, there were no parameters to tune for the random forest model. As shown below, the number of neighbors was tuned for the k-nearest neighbors algorithm based on optimizing the accuracy of the model. From this, `r fit_knn$bestTune` nearest neighbors was chosen as the optimal parameter value.

#### K-Nearest Neighbors Tuning

```{r plot-knn}
plot(fit_knn)
b <- summary(fit_knn)
```

```{r model-rf}
set.seed(100)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
invisible(
  capture.output(
    fit_rf <- train(
      starlink~
          inclination_degrees+
          apogee_km,
          data = sat_train,
          method = "rf",
          ntree = 500,
          trControl = ctrl,
          preProcess = c("center","scale"),
          tuneLength = 20)
  )
)

# Obtain probability of predictions for the test data
test_prob_rf <- predict(fit_rf,newdata = sat_test, type = "prob")$"TRUE"
```

## Column {.tabset}

### ROC and AUC

#### ROC curves

```{r roc-plot, fig.width=fig_width, fig.height=fig_height, out.width = "80%"}
model_var_name <- "Model"

extract_roc <- function(model_name, roc_output){
  roc_df <-
    tibble(
      Sensitivity = roc_output$sensitivities,
      Specificity = roc_output$specificities,
      one_sub_spec = 1 - Specificity,
      model = model_name
    )
  return(roc_df)
}

roc_log <-
  roc(
      response = sat_test$starlink,
      predictor = test_prob_log,
      levels = c("FALSE","TRUE"),
      direction = "<"
    )
roc_knn <-
  roc(
      response = sat_test$starlink,
      predictor = test_prob_knn,
      levels = c("FALSE","TRUE"),
      direction = "<"
    )
roc_rf <-
  roc(
      response = sat_test$starlink,
      predictor = test_prob_rf,
      levels = c("FALSE","TRUE"),
      direction = "<"
    )

x1 <- 
  extract_roc(
    "Logistic Regression",
    roc_log
  )
x2 <- 
  extract_roc(
    "K-Nearest Neighbors",
    roc_knn
  )
x3 <- 
  extract_roc(
    "Random Forest",
    roc_rf
  )
roc_df_full <- 
  rbind(x1, x2, x3) |> 
  mutate(
    model = factor(model),
    model = relevel(model,ref = "Logistic Regression")
  )

roc_df_full |> 
  ggplot(
    aes(
      x = one_sub_spec,
      y = Sensitivity,
      colour = model,
      linetype = model,
    )
  ) +
  geom_line() +
  geom_abline(
    slope = 1, 
    intercept = 0,
    colour = "gray",
    linetype = "solid") +
  labs(
    x = "1 - Specificity",
    colour = model_var_name,
    linetype = model_var_name
  ) +
  default_theme() +
  theme(
    axis.text =
      element_text(
        size = 12
      )
  )
```

#### AUC values

Logistic regression: `r sprintf("%.4G",roc_log$auc)`

K-nearest neighbors: `r sprintf("%.4G",roc_knn$auc)`

Random forest: `r sprintf("%.4G",roc_rf$auc)`

The ROC curves, with steep initial slopes, along with the AUC values near 1, suggest that all models perform well.

### Performance comparison

The accuracy, sensitivity, specificity, and kappa performance metrics were calculated for all models with both the training data and the test data. While the logistic regression model was slightly lower than the k-nearest neighbors model and the random forest model, it still performed quite well relative to the ML models.

#### Plot of metrics on train and test data

```{r collect-perf-metrics, fig.width=fig_width, fig.height=fig_height*1.25, out.width="100%"}
model_names <- c("Logistic Regression","K-Nearest Neighbors","Random Forest")
model_list <- list(fit_orbit_incl_apogee,fit_knn,fit_rf)
model_pred_fncs <-
  list(
    function(model,data){
      factor(
        ifelse(
        predict(
          model,
          newdata = data,
          type = "response") > 0.5,
        TRUE,
        FALSE
        )
      )
    },
    \(model,data) predict(model,newdata = data),
    \(model,data) predict(model,newdata = data)
  )

data_list <- 
  list(sat_test, sat_train)

data_names <-
  c("Test","Train")


# collect confusion matrices for all combinations
conf_results_list <- list()
conf_results_name <- c()
conf_results_model <- c()
sen_vec <- c()
spec_vec <- c()
pred_list <- list()
iter <- 1
for(ii in 1:length(model_names)){
  # ii is which model
  for(jj in 1:length(data_list)){
    # jj is which data
    pred_list[[iter]] <-
      model_pred_fncs[[ii]](
          model_list[[ii]],
          data_list[[jj]])
    
    conf_results_list[[iter]] <-
      confusionMatrix(
        pred_list[[iter]],
        data_list[[jj]]$starlink,
        positive = "TRUE"
      )
    
    conf_results_name[iter] <- data_names[jj]
    conf_results_model[iter] <- model_names[ii]
    
    sen_vec[iter] <-
      sensitivity(pred_list[[iter]],
              data_list[[jj]]$starlink,
              positive = "TRUE")
    
    spec_vec[iter] <-
      sensitivity(pred_list[[iter]],
              data_list[[jj]]$starlink,
              negative = "FALSE")
    
    iter <- iter + 1
  }
}


acc_vec <- sapply(conf_results_list,\(x) x$overall["Accuracy"])
kappa_vec <- sapply(conf_results_list,\(x) x$overall["Kappa"])


perf_metrics <- tibble(
    model = conf_results_model,
    check = conf_results_name,
    Accuracy = acc_vec,
    Sensitivity = sen_vec,
    Specificity = spec_vec,
    Kappa = kappa_vec
  ) |>
  mutate(
    check = relevel(
      as_factor(check),
      ref = "Train"
      ),
    model = relevel(
      as_factor(model),
      ref = "Logistic Regression"
    )
  )

p_fnc <- function(y_var){
  perf_metrics |>
  ggplot(
    aes(
      x = model,
      y = {{y_var}},
      fill = check
    )
  ) +
  geom_col(
    position = "dodge"
  ) +
  default_theme()
}

p_acc <- p_fnc(Accuracy)
p_sen <- p_fnc(Sensitivity)
p_spec <- p_fnc(Specificity)
p_k <- p_fnc(Kappa)

p_acc/p_sen/p_spec/p_k +
  plot_layout(
    axes = "collect",
    guides = "collect"
    )
```

### Confusion matricies

All models performed well with very low percentages of the data with incorrect predictions.

**Note:** The values on the confusion matrices are the percentages of the total number of values in the respective data sets (`r nrow(sat_train)` observations for the training data, `r nrow(sat_test)` observations for the test data).

#### All confusion matrices

```{r conf-matrices, fig.width=fig_width*1.5, fig.height=fig_height*1.75, out.width="100%"}
df_conf_matrix_full <- tibble()
for(iter in 1:length(conf_results_name)){
  curr_conf <-
    conf_results_list[[iter]]
  curr_df <- expand_grid(ii = 1:nrow(curr_conf$table),
                         jj = 1:ncol(curr_conf$table))
  curr_df <-
    curr_df |> 
    mutate(
      pred = c("Other","Starlink")[ii],
      ref = c("Other","Starlink")[jj],
      val = pmap_vec(list(ii,jj),\(curr_row,curr_col) curr_conf$table[curr_row,curr_col]),
      perc_val_dec = val/length(pred_list[[iter]]),
      model = conf_results_model[iter],
      check = conf_results_name[iter]
    )
  df_conf_matrix_full <-
    rbind(
      df_conf_matrix_full,
      curr_df
    )
}

df_conf_matrix_full <-
  df_conf_matrix_full |> 
  mutate(
    check = relevel(
      as_factor(check),
      ref = "Train"
      ),
    model = relevel(
      as_factor(model),
      ref = "Logistic Regression"
    ),
    pred = relevel(
      as_factor(pred),
      ref = "Other"
    ),
    ref = relevel(
      as_factor(ref),
      ref = "Other"
    )
  )

text_size <- 12
df_conf_matrix_full |> 
  ggplot(
    aes(
      x = ref,
      y = pred
    )
  ) +
  geom_tile(
    mapping = aes(fill = perc_val_dec*100)
  ) +
  geom_text(
    mapping = aes(label = sprintf("%.1f%%",perc_val_dec*100)),
    colour = "white"
  ) +
  scale_y_discrete(
    limits = rev(levels(df_conf_matrix_full$pred))
  ) +
  scale_x_discrete(
    position = "top"
  ) +
  default_theme() +
  default_fill_continuous(
    name = "Percent of data set (%)"
    ) +
  facet_grid(model~check) &
  labs(
    x = "Actual",
    y = "Predicted"
  ) & 
  theme(
    strip.text =
      element_text(
        size = text_size
      ),
    strip.text.y =
      element_text(
        angle = -45,
      ),
    axis.text =
      element_text(
        size = text_size
      ),
    axis.title =
      element_text(
        size = text_size
      ),
    legend.text =
      element_text(
        size = text_size
      ),
    legend.title =
      element_text(
        size = text_size
      ),
    legend.position = "bottom"
  )
```

### Comparison

All three models performed very well, with all accuracy values above 90%. In all metrics, the logistic regression performed very similarly to the other ML models. This indicated that the logistic regression model could provide similar results with simple interpretation of the coefficients. Since the accuracy of this model was still very high, the model had very low rates of false positives and false negatives, and it performed vary similarly on the training and test data, it was determined that the logistic regression model would provide the best balance between complexity and accuracy and would be the best model for this study. However, the fact that both machine learning models had almost perfect accuracy scores hinted that some overfitting might be present. To investigate that, decision boundaries plots were constructed; see the next page for details.

# Decision boundary

## Column {data-width="550"}

### Decision boundary plots

```{r decision-boundary-setup}
n <- 500
pred_tibble <-
  tibble(
      apogee_km = rep(
        seq(
          min(
            sat_fit$apogee_km
          ),
          max(
            sat_fit$apogee_km
          ),
          length = n
        ),
        n
      ),
      inclination_degrees = rep(
        seq(
          min(
            sat_fit$inclination_degrees
          ),
          max(
            sat_fit$inclination_degrees
          ),
          length = n
        ),
        each = n
      ),
    )


build_decision_plot <- function(df){
  df %>%
    ggplot(
      aes(
        y = apogee_km,
        x = inclination_degrees,
        z = prediction
      )
    ) +
    geom_raster(
      aes(fill = prediction),
    ) +
    geom_point(
      data = sat_fit,
      mapping = aes(
        y = apogee_km,
        x = inclination_degrees,
        z = NULL,
        shape = sat_type,
        colour = sat_type,
      ),
      alpha = default_point_alpha_level,
      size = 2
    ) +
    geom_contour(
      breaks = c(0.5),
      colour = "black",
      linetype = "dashed",
      linewidth = 0.8
    ) +
    labs(
      colour = leg_name_strl,
      shape = leg_name_strl,
      fill = leg_name_raster
    ) +
    default_fill_continuous(
      limits = c(0,1),
      name = "Probability of being Starlink"
    ) +
    default_theme()
}

pred_tibble_log <- pred_tibble
pred_tibble_knn <- pred_tibble
pred_tibble_rf <- pred_tibble

pred_tibble_log$prediction <-
  predict(fit_orbit_incl_apogee,pred_tibble,type = "response")
pred_tibble_knn$prediction <-
  predict(fit_knn,pred_tibble_rf,type = "prob")$"TRUE"
pred_tibble_rf$prediction <-
  predict(fit_rf,pred_tibble_rf,type = "prob")$"TRUE"
```

```{r decision-log}
d_log <- build_decision_plot(pred_tibble_log) +
  labs(
    title = "Logistic Regression"
  )

```

```{r decision-knn}
d_knn <- build_decision_plot(pred_tibble_knn) +
  labs(
    title = "K-Nearest Neighbors"
  )

```

```{r decision-rf}
d_rf <- build_decision_plot(pred_tibble_rf) +
  labs(
    title = "Random Forest"
  )

```

```{r decision-plot, fig.width=1.5*fig_width,fig.height=1.5*4/3*fig_height,out.width="100%", fig.cap="Decision boundary plots for all models. The decision boundary, plotted with the dashed contour lines, is the threshold where the probability of being a Starlink satellite is 0.5."}
d_combo <- 
  d_log+
  d_knn+
  d_rf+
  guide_area() +
  plot_layout(
    guides = "collect", 
    axes = "collect",
    ncol = 2) &
  labs(
    x = "Inclination (degrees)",
    y = "Apogee (km)"
  ) &
  theme(
    axis.text = element_text(size = 14)
  )

d_combo
```

## Column {data-width="450"}

### Comparison

All models show clear boundaries (dashed contour lines) that capture much of the difference between the two classes. However, both plots for the k-nearest neighbors and random forest models have quite complicated decision boundaries. This provides a partial explanation for the very high performance shown in the results section; those metrics could be very high because the models might be overfitting the data. While a train-test split was performed, many satellites in a particular part of a constellation follow similar orbits, so it is possible that those models were overfitting since they might have been provided with enough satellites from these "bands" of satellites. However, the logistic regression model was still able to provide comparable performance with a very simple and easy to interpret linear decision boundary. Because of this, it was determined that the logistic regression model was the best model for interpretation.

### Logistic interpretation

The logistic regression model indicates that lower apogees and lower inclinations tended to correlate with Starlink satellites. In fact, for the satellites in this study, a decrease in inclination of 10° corresponded to around `r sprintf("%.2G",exp(coef(fit_orbit_incl_apogee)["inclination_degrees"]*-10))` times higher odds of a satellite being a Starlink satellite for satellites with the same apogee. Similarly, a decrease in apogee of 100 km corresponded to around `r sprintf("%.2G",exp(coef(fit_orbit_incl_apogee)["apogee_km"]*-100))` times higher odds of a satellite being a Starlink satellite for satellites with the same inclination. But why is this important? See the Discussion page for the answer.

# Discussion {#sec-discussion}

## Column {.tabset data-width="500"}

### Conclusion

This study set out to answer the question, "Are the orbits of Starlink satellites different than other satellites in low Earth orbit?" The logistic model provided an answer by showing that lower inclinations and lower apogees tend to be associated with Starlink satellites, but why is this important? To understand this, two other LEO satellite communication constellations were compared to Starlink:

-   Iridium^®^ NEXT: A constellation of over 70 satellites providing global communication coverage [@Iridium:2019].

-   Eutelsat OneWeb: A constellation of around 600 LEO satellites also providing global communication coverage [@OneWeb:2025].

The plot below shows that Starlink is the only constellation out of these three that is mostly non-polar inclined. These non-polar-orbiting satellites do not reach the whole Earth, but only a section that excludes extremely north and south latitudes like Sweden, Alaska, and parts of Canada [@Mohan:2024]. This suggests that service could be prioritized to middle latitude customers. Will this be the trend of future satellite internet constellations? With Amazon starting it's planned internet constellation of over 3000 LEO satellites [@Kohnstamm:2025] among others, only time will tell. However, this study exemplified that logistic regression with this data could provide a simple, accurate, and easy to understand model to describe the differences between Starlink satellites and other satellites operational in 2023. Whether one is an avid researcher or a nighttime stargazer, this study can help make the unique aspects of the Starlink constellation shine brightly like the stars in the sky.

#### Plot comparing orbit types

```{r comm-sats, fig.width=fig_width, fig.height=fig_height*0.8,out.width="100%"}
const_irid <-
  sat_fit |> 
  filter(grepl("iridium",sat_name)) |> 
  mutate(
    constellation = "Iridium NEXT"
  )
const_oneweb <-
  sat_fit |> 
  filter(grepl("oneweb",sat_name)) |> 
  mutate(
    constellation = "Eutelsat OneWeb"
  )
const_strl <-
  sat_fit |>
  filter(is_starlink == TRUE) |> 
  mutate(
    constellation = "Starlink"
  )

df_const <-
  bind_rows(
    const_irid,
    const_oneweb,
    const_strl
  )

df_const |> 
  ggplot(
    aes(
      x = type_of_orbit,
      fill = type_of_orbit
    )
  ) + 
  geom_bar(
    show.legend = FALSE
  )+
  facet_wrap(
    vars(
      constellation
    ),
    scales = "free_y"
  ) +
  labs(
    x = ""
  ) +
  default_theme() +
  theme(
    axis.text.x = 
      element_text(
        angle = 30,
        hjust = 0.5,
        vjust = 0.75
      ),
  )
```

### Limitations/future direction

The satellite landscape is rapidly changing and developing. Because of this, new information could come out that could greatly improve the quality of the study, limiting the long-term quality of the conclusions. When new information becomes available, a future direction would be to add this information to keep the conclusions relevant. Also, the models in this study compared individual satellites, not satellite constellations, which limits the ability of the conclusions to compare many full constellations. Another future improvement could be to group the satellites by constellations and possibly use models to try to predict the differences between more than 2 satellite groups, e.g. using multinomial logistic regression. Despite these limitations, this study was still able to provide an accurate and easy to interpret model to understand the differences between Starlink satellites and other satellites in 2023.

## Column {data-width="500"}

### References/acknowledgements

Works Cited

::: {#refs .ref-section}
:::

<br>

Much of the other information came from lectures in MTH 543 linear models, Fall 2025, taught by Dr. Ying-Ju Tessa Chen. The author would like to acknowledge her teaching in this class and her insightful advice on this project.

**Note:** AI tools were used to provide basic programming examples for particular functions. All other code and all other information was written by the author.

### About the author

Lucas Hung is a graduate student at the University of Dayton pursuing a Masters of Science in Aerospace Engineering. He earned a Bachelors of Mechanical Engineering from the University of Dayton with a concentration in Aerospace and a minor in Music. Professionally, Lucas is passionate about aerodynamics, data analytics, machine learning, and other aerospace related topics. On the side, he likes to build and fly remote controlled aircraft and play different musical instruments.

LinkedIn:

<https://www.linkedin.com/in/lucaskh/>

Background

Column

Abstract

Terminology: Rocket science in 1 minute

Column

What’s up?

Operational LEO satellites by launch year and type

Research question

EDA

Column

Source

Data cleaning

Variable generation:

Variable selection

Variables used for prediction

Column

Sample entries

Box plots

Pairwise scatter plots

Methods

Column

Theory

Logistic Regression

K-Nearest Neighbors

Random Forest

Selection process

Column

Stratified train-test split

Tuning

Logistic

K-Nearest Neighbors

Random forest

Performance metrics

Results

Column

Logistic process

Logistic performance

Goodness of fit

Pseudo-R2

Multicollinearity

ML tuning

K-Nearest Neighbors Tuning

Column

ROC and AUC

ROC curves

AUC values

Performance comparison

Plot of metrics on train and test data

Confusion matricies

All confusion matrices

Comparison

Decision boundary

Column

Decision boundary plots

Column

Comparison

Logistic interpretation

Discussion

Column

Conclusion

Plot comparing orbit types

Limitations/future direction

Column

References/acknowledgements

About the author

Pseudo-R²