
data leakage
Data leakage occurs when information from outside the training dataset unintentionally influences a machine learning model, leading to overly optimistic or misleading results. Essentially, it means the model has access to data or clues that it wouldn't normally have when making predictions in real-world scenarios. This can happen if future data, duplicate data, or data related to the outcome is included in training. As a result, the model may perform well during testing but fails in real use, because it has "cheated" by using information it shouldn't have. Proper data handling avoids leakage and ensures accurate, reliable predictions.