The Factors Affecting Soybean Production in Indonesia Using Robust Regression with Least Median of Squares (LMS) Estimation

Soybean is a product with a source of protein that improves the nutrition of Indonesians. The demand for soybeans is increasing, but the domestic production is not sufficient, so that the soybean production in Indonesia must be increased. This study aims to determine the influential factors on soybean production in Indonesia. The data of soybean production in Indonesia had outliers. Outliers cause the residual is not normally distributed so that the assumption of normality is violated. This problem was solved using robust regression. The estimation used was the Least Median of Squares because this estimation has a quite large breakdown point value. The results of the study show that the soybean production in Indonesia was influenced by the field area, the number of soybean seeds, and rainfall. The most influential factor on soybean production is the number of soybean seeds, field area, and rainfall. The attempts that must be conducted by the government to increase soybean production are by having socialization about soybean cultivation and ensuring the availability of soybean seeds in Indonesia.


Introduction
Soybean is a plant used as the source of vegetable protein to improve the nutrition of Indonesians. In 2018 it is known that the total soybeans production in Indonesia is 82.598 thousand tons while the demand for soybeans reaches 2.5 tons (Subaedah et al., 2019). So far, the domestic production of soybean has not been able to meet the demand for soybean in Indonesia. Therefore, Indonesia imports soybeans from other countries to fulfill the demand for soybeans in Indonesia. If the consumption increases, the import, and production will increase as well. If import increases, then the consumption increases but production decreases (Ningrum et al., 2018). The domestic production of soybean is able to meet the demand for soybeans in Indonesia of around 30%, while 70% of soybean demands are met by import from other countries. This indicates the need to increase soybean production in Indonesia. Before increasing the soybean production, the factors influencing the soybean production in Indonesia are must be known first. The factors influencing soybean production in Indonesia are the field area, number of soybean seeds, and rainfall (Herawan, 2015).
Based on research conducted by Nasution (2017) using the regression method, it was concluded that the area of rice, corn, and soybeans had a strong influence on production. Based on (Mustikawati et al., 2018), the productivity of soybeans in dry land is 64.25% lower than the productivity of soybeans in the field. Productivity in dry land will be lower if there is a drought factor when the pod is forming and filling. According to (Suhartini, 2018) the area of paddy fields has decreased by 0.24% each year (Ruminta et al., 2020). Explained that there is a significant relationship between the changes in rainfall and soybean production, productivity, and planting area. According to (Rasyid, 2013) the soybean seeds will affect the plant height and the number of soybeans produced.
The regression analysis is a mathematical model that can be used to find out the relationship pattern between two or more variables (Montgomery, 2009). Ordinary Least Square (OLS) is one method that is often used to obtain parameter estimation values in regression modeling. OLS the classical assumption requirements are required so that the model can be considered as a good model. Some cases of classical assumption requirements were not fulfilled. One of the causes is outliers. Outliers cannot be simply discarded because they might include important information. In order to resolve this problem, a robust method is required. This method is known as robust regression (Febrianto et al., 2015).
Robust regression is a method used when the distribution of residual data is not normal, or several outliers affect the model. In the robust regression, there are several estimations, such as M estimation, S estimation, MM estimation, LMS estimation, and LTS estimation. The least Median of Squares (LMS) is a robust regression estimation method by minimizing the median and residual squares (Rousseeuw & Leroy, 1987). LMS has a high breakdown point of 50%. The breakdown point is one method to measure the robustness of an estimator. The greater the breakdown point value, the estimator will be more robust. According to Daniel (2019) the LMS estimator shows better result when compared to the OLS estimator because the resulting regression equation has a smaller error value.
The data of soybean production in Indonesia in 2014 have outliers. Outliers cannot be simply discarded because it is important information. Therefore, this study used robust regression with LMS estimation to determine the factors influencing soybean production in Indonesia.

Regression analysis
The regression analysis is a mathematical model that can be used to find out the relationship pattern between two or more variables. Simple linear regression is a regression model that has only one indepemdent variable (Neter et al., 1983). The multiple linear regression model is the extension of a simple linear regression model. In general, it can be written as: where: Y i : dependent variable of i-data. X j : independent variable of j-data. Β k : regression coefficient parameter.

F-Test and t-Test
F-Test is used to find out whether the independent variable has a linear relationship with the dependent variable or not. Fvalue is compared to Ftable at a significant level of 5%. If Fvalue is greater than Ftable, so there is at least one independent variable has an affect. The formula of the F-test is: (2) Ftable = F(k-1,n-k-1 ; α), k: the number of independent varibale, n: the number of data. (Sugiono, 2008) The t-test is used to find out the significance among dependent variables. The formula of the ttest is: ttable = t(n-k-1 ; α) , k: the number of independent variable, n: the number of data.

BASC 2021
72 where t : t-test value. r : correlation coefficient. n : the number of data.
The conclusion was drawn by comparing tvalue with ttable. If tvalue is greater or equal to ttable with a significance level of 5%, then the variable has a significant effect (Sugiono, 2008).

Classical assumption test
The classical assumption test is a requirement for a model considered as a good model, and it can be used in an analysis. The classical assumption are normality test, non-multicollinearity test, homoscedasticity test, and non-autocorrelation test.

Normality Test
The normality test aims to determine if the residual in a regression model is normally distributed (Ghozali, 2011). The most frequently used statistical test is by Kolmogorov-Smirnov Test. Where statistical test: where:

Non-multicollinearity test
The non-multicollinearity test is used to find out whether there is a correlation among independent variables in the regression model or not. A good regression model is a regression that does not contain multicollinearity (Ghozali, 2011). To measure the collinearity is by using Variance Inflation Factors (VIF) with the formula: where: j : 1,2, … , k. k : the number of the independent variable. R j 2 : the coefficient of determination R 2 . If the VIF value > 10, it shows strong multicollinearity.

Homoscedasticity test
Homoscedasticity Test is used to examine whether regression occurs an inequality of variance of the residual from one data to another. The requirement that must be met in the regression model is the absence of heteroscedasticity indications (Ghozali, 2011). This test used Breusch-Pagan. where: ESS : Explained sum of squares.

Non-Autocorrelation test
The non-autocorrelation test aims to examine whether there is a correlation between the error in periodt and the error in the previous period in the linear regression model. If there is autocorrelation, then it is called autocorrelation problem (Ghozali, 2011). One method that can be used to detect the presence of autocorrelation is Durbin-Watson (DW Test) Test. The decision of whether there is autocorrelation is:

Critical Area
Result When the DW value is between DU and 4 -DU no autocorrelation. When the DW value is smaller than DL positive autocorrelation. When the DW value is greater than 4 -DL negative autocorrelation.

Outlier detection
Outlier data is data that is far (extreme) from other data. The method used to identify can be by DFFITS (Difference fitted value FITS) method. The difference fitted value FITS is a method showing the value of change in the predicted value if a particular case is released and has been standardized (Dewi et al., 2016). The formula of DFFITS is as follows: where t i = e i √ n−k−1 SSR(1−h ii )−e i 2 e i is residual i, SSR is the sum of squared residuals, and h ii is leverage value. Data is considered an outlier if the |DFFITS| > 2 √ k n value and k are the numbers of parameters in the model, and n is the amount of data.

Robust Regression
Robust regression is a method used to overcome the problem of outliers. This method is an importand tool to analyzing data affected by outliers. When the data in linear regression has outliers and result in an abnormally distributed model, the robust regression model can be used (Rousseeuw & Leroy, 1987).

LMS Estimation
The basic principle of the robust regression LMS estimation method is to match most of the data after the outlier has been identified as a point that is not related to data (Rousseeuw & Leroy, 1987). In OLS, the thing that needs to be done is by minimizing the residual squares (∑ e i 2 n i=1 ), then in LMS, the thing that needs to be done is by minimizing the median of residual squares: M j = min{med e i 2 } = min{M 1 , M 2 , … , M s } where e i 2 is residual squares of the estimated results by OLS. The method to obtain the M 1 value is by finding the subsets of matrix X of h I data is: BASC 2021 74 where n is the amount of data, and p is the number of parameters. According to (Rousseuw, 1984), w ii weight is formulated with the following conditions: W ii weighter is determined based on the function of a weighter: where ε i * = e i σ and σ = 1.4826 [1 + 5 n−p ] √ M j . After the w ii weight is calculated, the W matrix can be formed as follows: where the w ij matrix entry= 0, where i ≠ j.

Data source and variables of the study
The data of the study were obtained from the publication of Badan Pusat Statistik: Statistik Indonesia 2020 and the publication of Setjen Pertanian RI: Statistik Sarana Pertanian 2019. The variables used in this study were the number of soybean productions according to the province of 2019 (Y), field area (X 1 ), the number of soybean seeds (X 2 ), and rainfall (X 3 ).

Analysis method
The steps for modeling the soybean production in Indonesia with LMS estimation were by: ]. 8. Sorting the variables according to the least residual squares, then cut them according to h value. 9. Estimating β new with h data. 10. Doing the first to sixth steps until obtaining the convergent β value. 11. Calculating w ii . 12. Estimating the regression coefficient β LTS with the w ii .

Results and Discussion
The modeling of soybean production in Indonesia of 2019 with OLS was: Ŷ = 29523 + 0.0184 X 1 + 84.3 X 2 − 12.5 X 3 . Based on the model above, it can be known that the field area value (X 1 ) increased by one unit, and other variables were considered constant, then it will increase soybean production by 0.0184. If the value of the number of soybean seeds (X 2 ) increased by one unit and other variables were considered constant, it would increase the soybean production by 84.3. If the rainfall (X 3 ) increased by one unit and other variables were considered constant, it would decrease the soybean production by 12.5. This modeling had an r-square of 91.9%. The variables of soybean production (Y) can be explained by field area variable (X 1 ), the number of soybean seeds (X 2 ), and rainfall (X 3 ) of 91.9%, while the remaining 8.1% was explained by other variables.
The classical assumption test is a requirement for a model considered as a good model, and it can be used in an analysis. In this study, normality test used Kolmogorov-Smirnov with the following results: Based on the table 1, it was obtained the p-value = 0.001588 < 0.05. Therefore, it can be concluded that residual data was not normally distributed. Residual data that was not normally distributed must be conducted an outlier detection. The next assumption test was a non-multicollinearity test. The measureent of multicollinearity used VIF. Based on table 2, it was obtained the VIF value < 10. Therefore, it can be concluded that there was no multicollinearity. The next assumption test was homoscedasticity using Breusch-Pagan. On the table 3, it was obtained the p-value = 0.2668 < 0.05. Therefore, it can be concluded th at the data was homogeneous or no heteroscedasticity indications. The last assumption test was non-autocorrelation test using Durbin Watson. When tested for the classical assumption of normality, it can be concluded that the residual data was not normal. Not normal residual data can be caused by the outlier. The outlier was de-   Based on table 6, it was known that F = 113.14 > F (2;30;0.05) = 0.302 or p-value < 0.05, so there is at least one independent variable has an affect rainfall affects the soybean production in Indonesia. Based on table 7 it was known that |t test | > t (30;0.05) = 2.042 or p-value < 0.05 and then the number of soybean seeds and rainfall significantly affect the soybean production in Indonesia. Meanwhile, for |t test | < t (30;0.05) = 2.042 or p-value > 0.05, the field area did not significantly affect the soybean production in Indonesia.
After having F-test and t-test, then it was continued by analyzing the modeling with LMS estimation. The modeling of soybean production in Indonesia of 2019 was: Ŷ = 16440 + 0.0254X 1 + 85.7X 2 − 6.84X 3 . Based on the model above, it can be known that if field area value (X 1 ), the number of soybean seeds (X 2 ) and rainfall (X 3 ) are constant, the soybean production in Indonesia is 16440 tons. If the field area value (X 1 ) increased by one unit, and other variables were considered constant, then it will increase soybean production by 0.0254. If the value of the number of soybean seeds (X 2 ) increased by one unit and other variables were considered constant, it would increase the soybean production by 85.7. If the rainfall (X 3 ) increased by one unit and other variables were considered constant, it would decrease the soybean production by 6.84. This modeling had an R 2 adjusted = 96.8% and R 2 = 97.1% which means that the variable of soybean production (X 2 ) can be explained by field area variable (Y), the number of soybean seeds (X 1 ), and rainfall (X 3 ) of 97.1%, while the remaining 2.9% was explained by other variables. there is at least one independent variable has an affect the soybean production in Indonesia. Based on table 9, it was known that |t test | > t (30:0.025) = 2.042 or p-value < 0.05, then the variables of the field area, the number of soybean seeds, and rainfall significantly affect the soybean production in Indonesia.
The soybean production model in Indonesia is estimated using OLS. The result of classical assumption test are that the normality test is not fulfilled, the homoscedasticity test is fulfilled, the non autocorrelation test is fulfilled and the non multicollinearity test is fulfilled. If the normality test is not fulfilled, outliers will be detected. There are 4 outliers in data. Then the LMS estimation will be used for soybean production data in Indonesia. In table 8, the F-test value of the LMS estimation model is more than F-table, so there is at least one independent variable has an affect. T-test was conducted to find out what variables were influental. In table 9 it can be concluded that the significant independent variables are the field area, the number of soybean seeds and rainfall.

Conclusion
The modeling of soybean production in Indonesia in 2019 with LMS estimation was Ŷ = 16440 + 0.0254X 1 + 85.7X 2 − 6.84X 3 with R 2 of 97.1%. The most influential factors on the soybean production in Indonesia were the number of soybean seeds, field area, and rainfall.